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Abstract 

We present a two-armed bandit model of decision making under uncertainty where the 
expected return to investing in the “risky arm” increases when choosing that arm and decreases 
when choosing the “safe” arm. These dynamics are natural in applications such as human capital 
development, job search, and occupational choice. Using new insights from stochastic control, 
along with a monotonicity condition on the payoff dynamics, we show that optimal strategies 
in our model are stopping rules that can be characterized by an index which formally coincides 
with Gittins’ index. Our result implies the indexability of a new class of restless bandit models. 
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1 Introduction. 


Bandit models are decision problems where, at each instant of time, a resource like time, effort, 
or money has to be allocated strategically between several options, referred to as the arms of the 
bandit. When selected, the arms yield payoffs that typically depend on unknown parameters. Arms 
that are not selected remain unchanged and yield no payoff. The key idea in this class of models is 
that agents face a tradeoff between experimentation (gathering information on the returns to each 
arm) and exploitation (choosing the arm with the highest expected value). 

Over the past sixty years, bandit models have become an important framework in economic 
theory, applied mathematics and probability, and operations research. They have been used to 
analyze problems as diverse as market pricing, the optimal desi gn o f clinical trials, product search 
and the research and development activities of firms (Rothschild 661, Berry and Fristedt , Bolton 
and Harris 0], and Keller and Rady 31])- To understand how firms set prices without a clear un¬ 
derstanding of their demand cnrves, Rothschild 66| posits that firms repeatedly charge prices and 
observe the resulting demand. Setting prices too high or too low is costly for firms (experimenta¬ 
tion), but allows them to learn abont the optimal price (exploitation). In the optimal design of 
clinical trials. Berry and Fristedt ^ formulate the problem as: given a fixed research budget, how 
does one allocate effort among competing projects, whose properties are only partially known at 
a given point in time but may be better understood as time passes. In prodnct search, cnstomers 
sample products to learn about their quality. Their optimizing behavior can be described as in 
Bolton and Harris [^, [^. In these models, news about the quality of the product arrive continuously. 
The situation where news arrive only occasionally, e.g. in the form of break-thronghs in research, 
is modeled by Keller et al. |32l. l31[. 

An important assumption in the classical bandit literatnre is that the reward distribution of 
arms that are not chosen does not evolve; they rest (Gittins, Glazebrook, and Weber 2^). This 
assumption seems natural in many applications. Yet, in many other important scenarios, it seems 
overly restrictive0 Gonsider, for instance, the possibility of dynamic complementarities in human 
capital productionJ§ Imagine a student who has the choice of whether or not to invest effort into 
her school work. Today’s effort is rewarded by being more at ease with tomorrow’s course work, or 
the ability to glean a deeper understanding from class lectures. As Cunha and Heckman [l^ note, 
“learning begets learning.” Gonversely, not doing one’s assignments today might give instantaneous 
gratification, but makes tomorrow’s school work harder. More generally, this dynamic can be found 
in the context of human capital formation when early investments in human capital increase the 
expected payoff of future investments, while a lack of early investments has the reverse effect. These 
dynamics require arms that evolve even when they are not used. 

As a second example, consider an nnemployed worker looking for a job. With every job ap¬ 
plication, she gathers both information about the job market and experience in the application 


^The importance of relaxing this assumption has been recognized early on in the seminal work of Whittle 
who proposed clinical trials, aircraft surveillance, and assignment of workers to tasks as potential applications. 
■^Cunha et al. [l^ make a similar argument in a different context. 
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process, which typically increases her chances of successful future job applications. Conversely, not 


actively searching for a job may decrease the probability of finding a . 
This is empirically well-documented (Kroft, Lange, and Notowidigdo 


ob in future applications. 


36l ]l and could be due to 


market penalties for unemployment spells, being disconnected from the changing characteristics of 
the job market and the application process, or be considered a signal of low motivation by potential 
employers. 


Bandits whose inactive arms are allowed to evolve are known as restless bandits §1 Generally, 


optimal strategies for restless bandits are unknowno Nevertheless, when a certain indexability 
condition is met, Whittle’s i 


Weiss 


index 0 can lead to approximately optimal solutions (Weber and 
Ty, [7^). This index plays the same fundamental role for restless bandits that Gittins’ index 


21[ has for classical ones: it decomposes the task of solving multi-armed bandits into multiple 


tasks of solving bandits with one safe and one risky arm. The safe arm yields constant rewards 
and can be interpreted as a cost of investment in the risky arm. Deriving conditions that identify 
general classes of indexable restless bandit models is an important contribution—permitting more 
complete analysis of decision problems in which choices jointly effect instantaneous payoffs as well 
as the distribution of those payoffs in the future—and the subject of this paper. 


The origins of this work are the classical bandit models of Bolton and Harris [8| , Keller and Rady 
3ll |. and Cohen and Solan ll|, that we extend to the restless case. In these works, the reward 
from the risky arm is Brownian motion, a Poisson process, or a Levy process. The unobserved 
quantity is a Bernoulli variable. Our model is an extension of these models containing them as 
special cases§ Namely, we allow the same generality of reward processes with both volatility and 
jumps, but make the reward distribution dependent on the type of the agent and the history of 
past investments. The latter dependence is mediated by a real valued variable that increases while 
the agent invests in the risky arm and decreases otherwise. In line with our motivating examples 
of human capital formation and job search, we call this variable the agent’s human capital. 

The bandit model is first formulated as a problem of stochastic optimal control under partial 
observations in continuous time@ Standard formulations of the control problem with partial obser¬ 
vations do not work for restless bandit models (see [Section 2.21 for a discussion ). H owever, we show 

can be used and 


801, and Kohlmann ^ 


that the frameworks of Fleming and Nisio [19|, Wonham 
extended to general controlled Markov processes. We describe these issues in detail in [Section 2.21 
since they are rarely discussed in the context of bandit problems. 

The first result in this paper is a separation theorem (|Theorem ip that establishes the equiva¬ 
lence of the control problem with partial observations to a control problem with full observations 


^Bandits where the active and passive action have opposite effects on payoffs are called bi-directional bandits 
(Glazebrook, Kirkbride, and Ruiz-Hernandez [1^), and our model falls into this class. 

^Numerical solutions can be obtained by (possibly approxim ate) dynamic programming or a linear programming 
reformulation of the problem (Kushner and Dupuis Powell [^, and Nino-Mora 0 )- 

^However, some of these works focus on strategic equilibria involving multiple agents, whereas we only treat the 
single agent case. 

^Modeling time as continuous allows one to treat discrete-time models with varying step sizes in a unified frame¬ 
work. We show in ITheorem ll that discrete-time versions of the model converge to the continuous-time limit. This is 
not true in some other and recent approaches (see Remarks [T] and [3]) . 
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called the separated control problem. This equivalence is crucial for the solution of the problem and 
is implicitly used in many works, including Bolton and Harris [^, Keller and Rady 311, and Cohen 
and Solan 11[ . The separated problem is derived from the partially observable one by replacing the 


unobserved quantity by its filter, which is its conditional distribution given the past observations. 
Put differently, the filter is the belief of the agent about the hidden state variable. In the separated 
problem, admissibility of controls is defined without the strong measurability constraints present 
in the control problem with partial observations. Therefore, standard results about the existence 
of optimal controls and the equivalence to dynamic and linear programming can be applied. 

Our second, and main, result (|Theorem 21) is the optimality of stopping rules, meaning that it 
is always better to invest first in the risky arm and then in the safe arm instead of the other way 
round. This result hinges on the monotonic dependence of payoffs on past investment. Intuitively, 
the sequence of investments matters for two reasons. First, investments in the risky arm reveal in¬ 
formation about the distribution of future rewards. The sooner this information becomes available, 
the better. Second, early investments in the safe arm deteriorate the rewards of later investments 
in the risky arm. By contrast, early investments in the risky arm do not make the safe arm any 
less profitable. 

We present an unconventional approach to show the optimality of stopping rules. The work 
horse of most of the bandit literature is either the Hamilton-Jacobi-Bellman (HJB) equation or 
a setup using time changes. The inclusion of human capital as a state variable turns the HJB 
equation into a second order partial differential-difference equation. It seems unlikely that explicit 
solutions of this equation can be found. Moreover, the approach using time changes is not well 
adapted to the new dynamics of our model. We circumvent these difficulties by investigating the 
sample paths of optimal strategies. More specifically, we discretize the problem in time and show 
that any optimal strategy can be modified such that the agent never invests after a period of not 
investing and such that the modified strategy is still optimal. This interchange argument has been 
originally developed by Berry and Fristedt 7| for classical bandits. It turns out that the monotonic 
dependence of the payoffs on the amount of past investment is exactly what is needed to generalize 
the argument to restless bandits. 

Once the optimality of stopping rules is established, it follows easily that optimal strategies 
can be characterized by an index rule. Formally, the index is the same as the one proposed in the 
celebrated result by Gittins 21[ on classical bandits, but inactive arms are allowed to evolve. The 
explicit formula for the index yields comparative statics of optimal strategies with respect to the 
parameters of the model. Most importantly, subsidies of the safe arm enlarge the set of states where 
the safe arm is optimal, which means that our bandit model is indexable in the sense of Whittle 
79| (see Proposition 2). More generally, any arm of a multi-armed restless bandit that satisfies 


our monotonicity condition is indexable. To our knowledge, this is the first time that a sufficient 
condition for indexability of a general class of restless bandits with continuous state space and a 
corresponding rich class of reward processes has been formulated^ 


^Some sensor management models are indexable and have a continuous state space after their transformation 
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To explain the structure of optimal strategies, we consider how information is processed by 
agents in our model. We work in a Bayesian setting where the agent has a prior about being either 
“high” or “low type.” Rewards obtained from the risky arm depend on this type and are used by 
the agent to form a posterior belief. The current levels of belief and human capital determine at 
each stage whether it is optimal to invest in the risky or safe arm. Namely, there is a curve in the 
belief-human capital domain such that it is optimal to invest in the risky arm if the current level 
of belief and human capital lies to the right and above the curve. Otherwise, it is optimal to invest 


in the safe arm. The curve is called the decision frontier (see Proposition 1). 

There is, however, an important, and potentially empirically relevant, difference to classical ban¬ 
dit models: not only is the safe arm absorbing—it is depreciating; agents drift further and further 
away from the frontier. Empirically, this implies that there are very few “marginal” agents, i.e., 
agents at the decision frontier. Programs (e.g. lower class size, school choice, financial incentives) 
designed to increase student achievement at the margin are likely to be ineffective unless: (a) they 
are initiated when students get close to the decision frontier, or (b) force inframarginal students 
to invest in the risky arm (e.g. some charter schools, see Bobbie and Fryer [l^). Consistent with 
Cunha et al. [l^, our model predicts that, on average, the longer society waits to invest, the more 
aggressive the investment needs to be. This is in stark contrast to classical bandit models, where 


agents accumulate at or near the frontier (in the sense of Proposition 5), and is one of the key 
motivations of our model. 

The paper is structured as follows. [Section ^ provides a brief review of the bandit literature 
in economics and applied mathematics. [Section contains the dehnitions of the control problems 
and the separation theorem. [Section 4l specializes the general framework of the previous section 
to restless bandit models satisfying the monotonicity condition and and analyzes the structure of 
optimal strategies. Finally. [Section 51 concludes. 


2 Previous literature. 


2.1 Bandit models. 




Originally developed by Robbins [6^, bandit models have been used to analyze a wide range of 
economic and applied math problems! The first paper where a bandit model was used in an 
economic context is Rothschild j^, in which a single firm facing a market with unknown demand 
has to determine optimal prices. Subsequent applications of bandit models include partner search, 
effort allocation in research, clinical trials, network scheduling and voting in repeated elections 
(McCall an^ McCall [^, Weitzman 77], Berry and Fristedt j^, Li and Neely 45], and Banks and 
Sundaram 


to fully observed Markov decision problems (Washburn Hi- This is, however, not the case in their formulation as 
control problems with partial observations. 

®Basu, Bose, and Ghosh [^, Bergemann and Valimaki [^, and Mahajan and Teneketzis provide excellent 
surveys of the literature on bandit models. The monographs by Presman and Sonin [^, Berry and Fristedt and 
Gittins, Glazebrook, and Weber contain more detailed presentations. 
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Classical bandits with reward processes driven by Brownian motion or a Poisson process were 
first solved by Karatzas 291 and Presman 631. Subsequently, Bolton and Harris [^, and Keller 
e.a. [ 33 , l3l|, l30| derived explicit formulas for optimal strategies in the case where the unobservable 
quantity is a Bernoulli variable and treated strategic interactions of multiple agents. Cohen and 
Solan [ll| unified the formulas obtained for the single agent case and solved a bandit model where 
the reward is driven by a Levy process with unknown Levy triplet. 

Many extensions and variations of classical bandit problems have been proposed, including: 
aandits with a varying finite or infinite numbers of arms (Whittle 78| and Banks and Sundaram 
i), bandits where an adversary has control over the payoffs (Auer et al. j^), bandits with depen- 


jarti, and Agarwal 571]), bandits where multiple arms can be chosen 


79(1), bandits whose arms yield rewards even when they are inactive 


dent arms (Pandey, Chakra 
at the same time (Whittle 
(Glazebrook, Kirkbride, and Ruiz-Hernandez 23|), and bandits with switching costs (Banks and 
Sundaram ^). 

One of the most mathematically challenging extensions is to allow inactive arms to evolve. Such 
bandits are often referred to as “restless bandits. ’|§ This term was coined in the seminal paper 
79 ( 1 . Beyond mathematical intrigue, there are many practical applications: aircraft 


of Whittle 


surveillance, sensor scheduling, queue management, clinical trials, assignment of workers to tasks 


robotics, and target tracking (Ny, Dahleh, and Feron [5^, Veatch and Wein (73|, Whittle (79|, Faihe 




and Muller 1^, and La Scala and Moran 4^). In aircraft surveillance, Ny, Dahleh, and Feron 5^ 
discuss the problem of surveying ships for possible bilge water dumping. A group of unmanned 
aerial vehicles can be sent to the sites of the ships. The rewards are associated with the detection 
of a dumping event. The problem falls into the class of sensor management problems, where a set 
of sensors has to be assigned to a larger set of channels whose state evolves stochastically. In linear 
Gaussian settings these problems can be reduced to deterministic problems and turn out to be 
indexable (Ny, Feron, and Dahleh 55|). In queue management, Veatch and Wein 73| consider the 
task of scheduling a make-to-stock production facility with multiple products. Finished products 
are stored in an inventory. Too small an inventory risks incurring backorder or lost sales costs, 
while too large an inventory increases holding costs. In robotics, Faihe and Muller 1^ consider 
the behaviors coordination problem in a setting of reinforcement learning: a robot is trained to 
perform complex actions that are synthesized from elementary ones by giving it feedback about its 
success. 


2.2 Optimal control with partial observations. 

In control problems with partial observations, strategies are not allowed to depend on the hidden 
state. To enforce this constraint, one requires them to be measurable with respect to the sigma 
algebra generated by the observations. In continuous time, this measurability condition is not 


strong enough to exclude pathological cases like Example 1 in this paper. 


This problem was solved in a setting with additive, diffusive noise by requiring the existence of 


^Some bandits with switching costs can be modeled as restless bandits (Jun 
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a change of measure, called Zakai’s transform (Fleming and Pardoux 20|), which transforms the 
observation process into standard Brownian motion. Unfortunately, this approach is not amenable 
to bandit models, where such a change of measure does not exist because the volatility of the 
observation process depends on the strategy. Another approach, which was applied successfully 
to classical bandit models, is to define strategies as time changes (El Karoui and Karatzas 0). 
Unfortunately, this technique does not work for restless bandit problems, where inactive arms are 
allowed to evolve. 

Our approach can be seen as a generalization of Fleming and Nisio [0, Wonham 801, and 
Kohlmann 3^. In these works, the strategies are required to be Lipschitz continuous to ensure 


well-posedness of the corresponding martingale problem. This excludes discontinuous strategies 
like cut-off rules, which are typically encountered in bandit problems. We replace the Lipschitz 
condition by the weaker and more direct requirement that the martingale problem is well-posed. 
The resulting class of admissible strategies is large enough to contain optimal strategies of classical 
bandit models and of the restless bandit model in [Section 4l It is also small enough to exclude 


degeneracies like Example 1 and to admit approximations in value by piecewise constant controls 
1 see [Theorem ip . For piecewise constant controls the definition of admissibility is unproblematic. 


2.3 Optimality of stopping rules. 

For classical bandit models with one safe and one risky arm, the optimality of stopping rules is a 
well-known result (Berry and Fristedt fl and El Karoui and Karatzas 0]). Several approaches to 
establish this result can be found in the literature. In one approach, the rewards of each arm are 
fixed in advance and strategies are time changes. The reward that is obtained under a strategy is the 
time change applied to the reward process. This setup, which has been proposed by Mandelbaum 
48l |. allows a very simple formulation of the measurability constraints on the strategies. It is, 
however, not well-suited to bandits with evolving arms. In a second approach, one solves the 
Hamilton-Jacobi-Bellman (HJB) equation for the value function. When this succeeds, the explicit 
form of the value function can be used to establish the optimality of stopping rules (Bolton and 


Harris Keller, Rady, and Cripps [3|, and Cohen and Solan 0). In our model, however, the 
dynamics of the reward distribution introduce an additional state variable, which turns the HJB 
equation into a non-local partial differential equation which we cannot solve directly. Moreover, the 
value function might not be a solution in a classical sense. Pham 60|,l59l] showed that under suitable 
assumptions, the value function is a viscosity solution of the HJB equation. It remains open how 
this could be used to show that stopping rules are optimal. The third approach is to rewrite the 
problem as a linear programming problem. This makes both classical and restless bandit problems 
amenable to efficient numerical computations and can also yield some qualitative insight (Nino- 
Mora 0 )0 The fourth approach (and the one we emulate) is based on a direct investigation of 
the sample paths of optimal strategies and an evaluation of the benefits of investing in the risky 
arm sooner rather than later. While this interchange argument was originally developed by Berry 


°Another numerical approach is dynamic programming/value function iteration. 
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and Fristedt [7j for classical bandit models, it turns out that the monotonicity assumption on the 
payoffs is what is needed to make the argument work in the more general setting of restless bandits. 


2.4 Indexability. 


Gittins 


2 l| characterized optimal strategies in classical bandit models by an index that is assigned 


to each arm of the bandit at each instant of time. The optimal strategy is to always choose the 
arm with the highest index. The indices can be calculated for each arm separately, which reduces 
the complexity of multi-armed bandits to that of two-armed bandits with one safe and one risky 
arm. 

In general, optimal strategies in restless bandit models do not admit an index representation. 
Nevertheless, a Lagrangian relaxation of the problem proposed by Whittle 79(] yields index strate¬ 
gies that are approximately optimal (Weber and Weiss [3, [^). The corresponding “Whittle 
index” (Whittle 79|) is the Lagrange multiplier in a constrained optimization problem and has 
an economic interpretation as a subsidy for passivity or a fair charge for operating the arm. A 
major challenge to the deployment of Whittle’s index is that it can only be defined when a certain 
indexability condition is met. In this condition, each arm of the restless bandit is compared to a 
hypothetical arm with known and constant reward. The indexability condition holds if the set of 


10 


states where the safe arm is optimal is increasing in the reward from the safe arm[ 

The question of indexability of restless bandit models is subtle and not yet fully understood. 
Gittins, Glazebrook, and Weber 2^ give an overview of various approaches to establish the index- 


ability of restless bandit models. Partial answers are known for bandits with finite or countable 
state spaces. Indexability of such models can be tested numerically in a linear programming re- 
brmulation of the Markov decision problem (Klimov 3^). In another line of research, Nino-Mora 


5ll | showed that indexability holds for restless bandits satisfying a partial conservation law, which 


can be verified by running an algorithm. While this can be used to test the indexability of specific 
restless bandit problems, it does not provide much qualitative insight into which restless bandits 
are indexable. One would like to have conditions that identify general classes of indexable restless 
bandit models—this is the subject of this paper. 

Some results in this direction have been obtained for various bandit models related to sensor 
management and dynamic multichannel access, see the papers of Nino-Mora 5^, Liu and Zhao 46l |. 
Ny, Feron, and Dahleh 5^ and the survey of Washburn 7^. Further classes of indexable problems 
are the dual speed problem of Glazebrook, Nino-Mora, and Ansell 2^, the maintenance models 
of Glazebrook, Ruiz-Hernandez, and Kirkbri de [2511 . and the spinning plates and squad models of 
Glazebrook, Kirkbride, and Ruiz-Hernandez [231]. Our paper is in line with these works in that it 
trades indexability for specific structnral conditions. 


^^This is a monotonicity condition on the optimal strategy, which is not to be confounded with our monotonicity 
condition on the payoffs and the evolution of human capital. 














3 Stochastic control with partial observations. 


ISection .TTl provides the general setup. The control problem is formulated in Sections 13.2113.31 
ISection 3.41 contains all assumptions and ISection 3.51 the main result. Some general notation can 
be found in Appendix A in the Appendix. 


3.1 Setup. 

U is a finite set, X = {0,1}, and Y is a finite dimensional vector space@ Controls are U-valued 
caglad processes U, the hidden state is an X-valued random variable X, and the observations are 
cadlag Y-valued processes Y. The rewards at time t are given by b{Ut, X,Yt) for some measurable 
function 6 :UxXxY—)-M. Rewards are discounted exponentially at rate p > 0 over an infinite 
horizon, and the aim is to maximize expected discounted rewards. 

The evolution of Y depends on a caglad U-valued process U and on the hidden state X. More 
specifically, the joint distribution of X and Y will be characterized by a controlled martingale 
problem associated to a linear operator 

A: V{A) C R(X X Y) ^ R(U x X x Y), 

where B denotes the bounded measurable functions. The posterior probability that X = 1 given 
} is denoted by P, i.e., P is a [0, l]-valued cadlag version of the martingale E[A | XY]- 
Mathematically speaking, P is called filter of A, and economically speaking, the agent’s belief in 
X = 1. The joint evolution of {P,Y) will be characterized by a linear operator 

G: V{g) C p([0,1] X Y) ^ P(U X [0,1] X Y). 

More specific assumptions on A, G, and the payoff function b will be made in ISection 3.41 


3.2 Control problem with partial observations. 

Our definition of controls with partial observations is non-standard and an improvement over the 
previous literature, as explained in ISection 2.21 

Definition 1 (Martingale problem for {A,F)). Let F be a caglad adapted iJ-valued process on 
Skorokhod space Dy[0,oo) with its natural filtration. {X,Y) is a solution of the martingale problem 
for {A, F) if there exists a filtration {Xt}, such that X is an Xo-measurable X-valued random 
variable, Y is an {Xt}-adapted cadlag Y-valued process, and for each f E T’(M), 

/(A, Yt) - /(A, To) - r Af{F{Y)s,X, y,)ds 

Jo 

^^Our proofs can be generalized to finite state spaces X at the cost of heavier notation and to compact control 
spaces U at the cost of additional criteria ensuring the existence of optimal non-relaxed controls for the discretized 
separated problem (see e.g. the discussion after Theorem 1.21 in Seierstad [^1. 
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is an {Tt}-martingale. The martingale problem is called well-posed if existence and local uniqueness 
holds under the conditions X = x and Yq = y, for all x and y G Y0 


Definition 2 (Control with partial observations). A tuple {U,X,Y) is called a control with partial 
observations if U = F{Y) holds for some process F as in Definition 1\ the martingale problem for 
{A, F) is well-posed, and {X,Y) solves the martingale problem for {A,F). 


Definition 3 (Value of controls with partial observations). The value of a control {U,X,Y) with 
partial observations is defined as 


jP-°-{U,X,Y) = E 



X,Yt)dt 


The set of controls with partial observations satisfying E[V] = p and Yq = y is denoted by ■ 
The value function for the control problem with partial observations is 


VP-‘’-{p,y) = snp{jP-°-{U,X,Y) : {U,X,Y) G . 


Remark 1 (Well-posedness condition). Every caglad {FJ }-adapted pro cess U coincides up to a 
null set with F{Y) for some process F as in [Definition II (see Delzeith 1^). Well-posedness of the 
martingale problem for (A, F) is, however, a much stronger condition. From the agent’s perspective, 
it requires the control to uniquely determine the outcome. From a mathematical perspective, it 


excludes pathological cases like the one presented in Example 1 below. It also ensures that controls 
can be approximated in value by piecewise constant controls, where such degeneracies cannot occur 
(see ITheorenTTll . 


Example 1 (Degeneracy in continuous time). Let X = U = {0,1}, Y = M, Af{u,x,y) = u{2x — 
l)fy{x, y) -|- ^ufyy{x, ?/) for each / G D{A) = C'|(X x Y). The aim is to maximize E[Jq = 

K[J^ pe~P^b{Ut, X,Yt)dt] over controls {U,X,Y) of the problem with partial observations, where 
b{u,x,y) = u{2x — 1). The following tuple {U,X,Y) satisfies all conditions of IDefinition 21 except 
for the well-posedness condition: V is a Bernoulli variable, W is Brownian motion independent 
of X, Yt = {t + Wt)X, Ut = l(o,oo)(^)^) F{Y)t = l(o,oo)([^)^]t)- Nevertheless, U depends on 
the supposedly unobservable state X. Actually, {U,X,Y) is optimal for the control problem with 
observable X, and should not be admitted as a control for the problem with unobservable X. 


Remark 2 (Topology on the set of controls). So far, there is no topology on the set of controls with 
partial observations. To get existence of optimal controls, one typically relaxes the control problem 


s and shows that the resulting set of admissible controls is 


39l . Il7l | . In control problems with partial observations involving 


by allowing measure-valued contro’ 
compact under some weak topology 
strong admissibility conditions as in IDefinition 21 the difficulty is that the set of admissible controls 
is not weakly closed. This difficulty can be avoided by transforming the problem into a standard 
problem with full observations, i.e., the separated problem. 

^®For reference, existence and local uniqueness of the above martingale problem are defined in |Appendix B| in the 
Appendix. 
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3.3 Separated control problem. 


The following definition is fully standard 39l. l4ll. l56|. 


Definition 4 (Separated controls). A tuple {U,P,Y) is called a separated control if there exists 
a filtration {Ft} such that U is an adapted, cdgldd U-valued process, {P,Y) is an adapted, cddldg 
[0,1] X Y-valued process, and for each f G P{G), the following process is an {Ft}-martingale: 


fiPt, Yt) - f{Po, To) - / GfiUs, Ps, n)ds. 

Jo 

Definition 5 (Value of separated controls). The value of a separated control {U,P,Y) is 


J^^-{U,P,Y) = E 


pe-P%{UuPt,YMt 


uo 


where b{u,p,y) = pb{u,l,y) + (1 — p)b{u,0,y). The set of controls y and the value function 
V^^'{p,y) are defined similarly as in \Definition 2 . 


Remark 3 (Filtered martingale problem). Following Stockbridge 70(], one could try the alternative 
approach of defining separated controls as solutions of the filtered martingale problem for A, i.e., 
the process 

ni(drE) = Pt6i{dx) + (1 - Pt)So{dx) 
is {F ^}-adapted and for each / G P{A), the process 



^ fix, Yt)ntidx) - fix, yo)no(da 
is a martingale with respect to some filtration containing {F ^}• Unfortunately, this definition does 


lx} - I I .A/(C/s,a:, Vs)ns(dx)ds 
/o Jx 


not rule out the pathological control presented in Example 1, and cannot be used for this reason. 


Remark 4 (Topology on the set of controls). The set of separated controls can be topologized 
by regarding them as probability measures on the canonical space Lu[0,oo) x Djq^ijxyIO, oo), sub¬ 
ject to the condition that the coordinate process solves the martingale problem in IDefinition 41 
Compactness and existence of optimal controls can be obtained by relaxing the control problem. 


1 [0, oo)-marginal 
261. llTII . It should 


This amounts to replacing Lu[0, oo) by the space of measures on U x [0,oo) wit 
equal to the Lebesgue measure and endowing this space with the vague topology 
be noted, however, that relaxed separated controls are not filters of relaxed controls with partial 
observations (see Appendix C in the Appendix). In other words, filtering is a non-linear operation 
on control problems, which does not commute with relaxation. 

3.4 Specification of the generators and assumptions. 

We specify the operators A and ^ in a list of assumptions (Assumptions [TH9|). Assumptions [THZl are 
unproblematic because they are direct conditions on the model coefficients and can be satisfied by 
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definition. Assumptions [8] and [9] are more difficult to verify. They require well-posedness of certain 
martingale problems related to A and G. This can be checked using standard results [35|, iTlJ or, in 
the presence of additional structure, using more specialized arguments as discussed in [Section 4.21 
The structure of the operator A in the following assumption allows Y to be a general Marko¬ 
vian semimartingale, whereas X is constant. To describe the behavior of small jumps, we fix a 
truncation function y;: '1'^ 'If) which is bounded, continuous, and coincides with the identity on a 
neighborhood of zero. 

Assumption 1 (Operator A). T>(A) = C'|(X x Y) and 


= dyf{x,y)/3{u,x,y) + -dyf{x,y)a'^{u,y) 


+ {f{x,y + z) - f{x,y) - dyf{x,y)x{z)^K{u,x,yAz), 


where /3:UxXxY^Y, cr^iUxY^YOY, and K is a transition kernel from U x X x Y to 
Y\{0}. 

The following bounds guarantee in a simple way that the value functions are finite and reduce 
technicalities in the proofs by avoiding additional localizations by stopping times. 

Assumption 2 (Boundedness). The expressions 


b{u,x,y), 


/3{u,x,y), 


(x‘^iu,y), 


(|z|^ A l)K{u,x,y,dz) 


are measurable and bounded over (u, x, y) G U x X x Y. 


The following assumption is related to Girsanov’s theorem 27|, Theorem III.3.24] applied to the 
conditional laws of Y given X. It is needed to describe the filter as a change of measure between 
the conditional laws. 

Assumption 3 (Girsanov). There exist functions fi: \] xY ^ Y and (/)2 :UxYxY —satisfying 


cr‘^{u,y)4>i{u,y) = /3(n, l,y) - /3(n,0,y) - / {(p 2 iu,y,z) - l)xiz){K{u, l,y,dz) + K{u,0,y,dz)), 

Jr 

K{u,l,y,dz) 


<p 2 {u,y,z) = 


{K{u,l,y,dz) + K{u,0,y,dz))/2 


The following assumption on the structure of the operator G encodes the filtering equations, 
which are derived in lLemma II 
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Assumption 4 (Operator G). V^G) = C'|([0,1] x Y) and 

Gfiu,P,y) = dyfip,y)Piu,p,y) + ^d‘^f{p,y)p‘^{l - pfMu,yV (T‘^iu,y)Mu,y) 
+ dpdyfip, y)p{l - p)a‘^{u,y)4>i{u,y) + ^dyf{p,y)a^{y,u) 

+ [f{p + jiu,p,y,z),y + z) - f{p,y) 

- dpf{p,y)j{u,p,y,z) - dyf{p,y)xiz)^K{u,p,y,dz), 


where 


f3iu,p, y) = p(3{u, 1, y) + (1 - p)f3{u, 0, y), 

K{u,p,y,dz) = pK{u,l,y,dz) + {1 - p)K{u,0,y,dz), 

■ ( . ^ _ P4>2{u,y,z) _ 

J u,P,y,z p(p:^(^u,y,z) + (1 - p){2 - 4)2iu,y,z)) 

and where it is understood that j{u,p,y,z) = 0 if p G {0,1}. 


P, 


The following assumption is a Novikov ^ condition for the uniform integrability of a stochastic 
exponential. It is needed in lLemma II to derive the filtering equations. The condition has also an 
information-theoretic interpretation, see IRemarkTl The specific version of the condition is due to 
Lepingle and Memin 4J, Theoreme IV.3]. 

Assumption 5 (Novikov condition). The following expression is bounded in {y,u) E Y x U.' 


^{u,y) = ^f>i{u,yy o-‘^{u,y)(i)i{u,y) 

+ (l)2{u,y,z){2 - (j) 2 {u,y,z))^ {K{u, 1, y, dz) + K{u, 0, y, dz)). 

The following two assumptions are used to show that solutions of martingale problems related 
to A and G depend continuously on parameters (c.f. ILemma 

Assumption 6 (Continuity). The expressions 

(3{u,x,y), cr^iu,y), Mu,y), / g{j{u,p,y,z),z)K{u,p,y,dz) 

Jy 

are continuous in {y,u) for all x G H, p G [0,1], and g G ^^([0,1] x Y) satisfying g{x) = 0(|xp) as 

|x| —)• 0. 

Assumption 7 (Condition on big jumps). 

lim sup I K(u, x,y,{z E Y: \z\ > a}): (n, x, y) E U x X x y| = 0. 

a—>-oo L ^ ) 
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The following two assumptions are used in various places to show that solutions of martingale 
problems related to the operators A and Q exist and depend continuously on parameters. In contrast 
to the previous assumptions these are indirect conditions on the coefficients of the model. Some 
examples of how they can be verified are presented in [Section 4.21 General sufficient conditions are 
given in (3a, [71 1. 


Assumption 8 (Well-posedness for the problem with partial observations). The martingale prob¬ 
lem for (A, F) is well-posed for all deterministic functions F: [0, oo) ^ U. 

Assumption 9 (Well-posedness for the separated problem). The martingale problem for (G,u) is 
well-pose^^ for all n G U. 

3.5 Separation and approximation result 

Theorem 1 (Separation and approximation). The following statements hold under Assumptions 

mm 

(a) The value functions of the control problems agree: 

V{p,y) := VP-°-{p,y) = V^^fp.y) < oo. 

(b) Controls can be approximated arbitrarily well in value by piecewise constant controls: 


V{p,y) = sup VAp,y), 

5>0 

where V^{p,y) = VP'°'’^{p,y) = V^^'’^{p,y) is the value function obtained by restricting to 
control process U which are piecewise constant on a uniform time grid of step size 5 > 0. 

Remark 5. • The importance of [Theorem li lies in its capacity to transform the control prob¬ 

lem with partial observations into a problem which can be analyzed and solved by standard 
methods like dynamic programming or linear programming (see [Section 2.21 for some back¬ 
ground and further references). The approximation result guarantees that the class of admis¬ 


sible strategies is small enough to exclude degeneracies like Example 1 It is also large enough 


to guarantee the existence of optimal strategies in the restless bandit problem presented in 
[Section 4l In the general case, existence of optimal strategies can be guaranteed by the 
standard technique of allowing relaxed (measure-valued) controls, as described in [Remark 41 


The intuition behind [Theorem 11 is that rational Bayesian agents base their strategy on the 
posterior distribution Ft of the unknown state and the public information 1). 


[Theorem 11 follows from a sequence of lemmas, which can be found in Appendix D in the 
Appendix. We now give a verbal proof of the theorem, highlighting the role that each individual 
lemma plays. 


^The martingale problem for [Q, F) is defined in analogy to the one for (A, F), see|Appendix B|in the Appendix. 
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Proof of \Theorem 1\ By Assumption 2| the reward function b is bounded, which implies that all 
value functions are finite. If {U,X,Y) is a control with partial observations and P is a cadlag 
version of the martingale E[X | then {U,P,Y) is a separated control with the same value by 
ILemmaTl Taking the supremum over all controls or step controls, one obtains that 






By ILemma"^ separated controls can be approximated arbitrarily well in value by separated step 
controls. Formally, this is expressed by the equation 

supV^^-\p,y) = V^^-{p,y). 

s>o 


In lLemma 31 it is shown that Markovian step controls of the separated problem can be transformed 
into controls of the problem with partial observations of the same value. This is done by a recursive 
construction, stitching together solutions {X, Y) of the martingale problem associated to A under 
constant controls corresponding to each step of the control process. As optimal Markovian controls 
exist for the discretized separated problem, 


Taken together, this implies that 






and 


V^-°-{p,y) < V^--{p,y) = snpV^--’\p,y) = supy) < FP-°-(p,y). 

<5 5 


□ 


4 A restless bandit model. 


We introduce and solve a specific restless bandit model motivated by dynamic complementarities 
in the production of human capital^ The bandit model has a “safe” arm with constant payoffs 
corresponding to the absence of investment in human capital. The second arm is “risky” and 
corresponds to investment in human capital. The risky arm has stochastic payoffs, which depend 
on an unobserved “type” X of the agent and her level of “human capital” H. The key assumption 
of the model is that investments increase the level of human capital, which in turn renders future 


investments more profitable (Assumption 12). This complementarity is well documented in the 




literature on human capital formation (see e.g. Cunha and Heckman [1^ and references therein). 
Mathematically speaking, it represents the only departure from the class of Levy bandits studied 
by Cohen and Solan [ll| . 


^®We point out that the use of our model is not restricted to human capital production. Complementarities 
between past and future investments arise in many other applications such as e.g. job search Isee lSection 111 . 
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The restless bandit model is formulated in [Section 4.11 Some examples are given in ISection 4.21 
and the model is solved in ISection 4.31 The asymptotics of the filter and strategy turn out to be 
similar to the classical case (see Sections I4.4l44.6p . but an important and potentially empirically 
relevant difference emerges in the analysis of populations of agents in ISection 4.71 in the long-run, 
all agents move away from the decision frontier. This makes untargeted incentives for investment 
ineffective and is one of the main motivations for the model at hand. 


4.1 Setup and assumptions. 

The general framework of ISection 31 including Assumptions [THSl remains in place. The following 
structural assumption encodes that (a) the observation process V = {H, R) takes values in IHI x M = 
(b) the process H has deterministic increments depending only on U and H, (c) under a choice 
17 = 0 of the safe arm, the reward process R has constant increments, and (d) under a choice U = 1 
of the risky arm, the reward process R has stochastic increments depending on X and H. 

Assumption 10 (Structural assumption). U = {0,1}, Y = BI x M = Y = {H,R). The 
coefficients (/3, a, K) of the generator A in\Assumption 1\ are of the form 


I3{l,x,h,r) = 
0-2(1,/i,r) = 




/ 3 ( 0 ,x,/i,r) = 
0-2(0,/i,r) = 


0 0 \ 

77(1, X, h, r, dh, dr) = 6o{dh)Kii{x, h, dr), 77(0, x, h, r, dh, dr) = 0, 


^H{0,h) 

k 

'o o\ 

.0 0 ’ 


where 


k eR, 


j^H : U X H — > H, / 3 /j : X x H — >■ R, o-jj : H —>■ R, 


and Kji is a transition kernel from X x H to R \ {0} satisfying sup^,/j |r|2 A |r|77ij(x, h, dr) < oo. 

In line with the literature on Levy bandits, the reward received at time t is the infinitesimal 
increment dR^. To rewrite this in terms of a reward function b{Ut,X, Ht) we impose the condition 


E 


pe 


-^^dRt 


= E 


pe-P%{Ut,X,Ht)dt 


Uo 1 Uo J 

ILemmall shows that this condition leads to the following specification of the reward function b. 
Assumption 11 (Reward function). The reward function 6: U x X x H is given by 


b{u, X, h) 


PR{x,h) + 
k, 

\ 


{r - x{r))KR{x,h,dr), ifu = l, 

if u = 0. 
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By the following assumption, investment in the risky arm makes future investments in the 
risky arm more profitable. This dependence is mediated by the process H, which increases with 
investment in the risky arm and decreases otherwise. 

Assumption 12 (Monotonicity condition). The condition fdniOjh) < 0 < /3/f(l,/i) holds for all 
/i € H. Moreover, the reward b{l,x,h) of the risky arm is non-decreasing in x and /i G H. 

4.2 Examples. 

We show how some well-known classical bandit models described in ISection 2.1l can be extended to 
restless bandit models, which naturally fit into the framework of this paper and satisfy Assump¬ 
tions mM A common feature of our extension is the presence of an auxiliary state variable Ht, 
whose dynamics are given by the ODE 

dHt = /3H{Ut,Ht)dt, Ho = 0, 

for some function : U x H —)• El such that the ODE is well-posed under any deterministic control 
U : [0, oo) ^ U. For example, this is the case if H increases or decreases linearly depending on the 
strategy: 


Ho — 0 , 


/3//(0,/i) = — 1, /3 h(1,/i) = 1. 


The purpose of the auxiliary state variable Ht is to make the risky arm more or less profitable 
depending on the amount of past investment in the risky arm. We show in Examples [2H11 below 
how this can be done for Gaussian, Poisson, and Levy bandits. 


Example 2 (Gaussian bandits). In the Gaussian bandit model introduced by Karatzas 29| the 
reward of the risky arm is a diffusion whose drift depends on the unobservable type X. This model 
becomes restless if the drift depends additionally on the level of human capital: 


dRt = UMX,Ht)dt + UtaR{Ht)dWt + (1 - Ut)k. 


Then {X,Y) = {X,H,R) is a controlled Markov process, and its generator A has the structure 


described in Assumptions [T] and [TOl Assumption 8 holds automatically thanks to Assumption 10 


and the well-posedness of the ODE for Ht under deterministic controlsl^ If (Hr and or are bounded 
continuous functions, /3ij(0, /i) < /3ij(l,/i), and aR{h) > 0, then Assumptions [THT^ are satisfied!^ 


Example 3 (Poisson bandits). In Poisson bandits, which were introduced by Presman [631], the 
reward of the risky arm is a Poisson process N whose jump intensity depends on the unobservable 

^®This follows from [2^ . Theorem III.2.16] noting that R has deterministic semimartingale characteristics under 
any deterministic control. We refer to [13, EJ and references therein for more general conditions for the well-posedness 
of stochastic differential equations and martingale problems. 

^ jAssumption 9| is satisfied because the coefficients of Q are bounded and Lipschitz continuons (see [13 . Theo¬ 
rem III.2..12] or (3 rI. E3). The verification of all other assumptions is straightforward. 


17 











type X. As an extension we allow the jump intensity to depend additionally on the cnrrent level of 
hnman capital Ht- Then the jump intensity becomes a function of X and Ht, and we set 


dRt = UtdNt + (1 - Ut)k, 


dNf = X{X,Ht)dt, 


where denotes the compensatocfl of the Poisson process N. Equivalently, the compensator of 
the jump measure of R is Kr^X, Ht,dr)dt, where KR{x,h,dr) = X{x, h)5i{dr). If A is a continu¬ 
ous bounded function satisfying 0 < X{0,h) < X{l,h), then Assumptions [THE] hold by the same 
reasoning as above@ 


Example 4 (Levy bandits). Levy bandits, which were introduced by Cohen and Solan 11[, gener¬ 
alize the class of Gaussian and Poisson bandits. They are characterized by the Levy triplet of the 
reward process, which depends on the unobserved type X. In our extension to a restless bandit 
model it may depend additionally on the current level of human capital. The characterization of the 
reward process in terms of Levy triplets is equivalent to the formulation in terms of the martingale 
problem for A. A sufficient condition for Assumption 9 is that the jump measures K^{1, h, dr) and 
iLij(0, h, dr) are equivalent for each Assumption 8 holds by the reasoning above, and all other 
assumptions are direct conditions on the model coefficients. 


All three examples are genuinely restless bandit models because the reward structure of the risky 
arm decreases while the risky arm is inactive. Optimal strategies for these models are provided by 
[Theorem 21 Some important differences to classical bandit models are pointed out in [Section 4.71 


4.3 Reduction to optimal stopping. 

Definition 6 (Gittins’ index). Gittins’ index G is defined 

( / A j ® (/o^ pe~P^dRt 

G{p, h) = inf < s : sup E ( / pe~^^{dRt — sdt) ) < 0 > = sup —^ 

I T \Jo J } T E(/(fpe-^Mt) 


where {1, P, H, R) is a separated control with constant control process U = 1 and initial condition 
{Pq,Hq) = {p,h), and where the suprema are taken over all stopping times T. 

Theorem 2 (Optimal stopping). The following statements hold under Assumvtions fTHiH 

(a) The value function V (see \Theorem 1\) does not depend on the initial value of the process R 
and can be written as V = V{p,h). 


^®See [^ . Theorem 3.17] for the definition of compensator or dual predictable projection. 

^ ^Assumption 9[ is satisfied because the martingale problem for Q is piecewise deterministic with finitely many 
jumps at exponential stopping times. 

^ lAssumption 9| follows from [lol . Theorem 3.3], noting that uniqueness holds for the filtered martingale problem 
for A as shown in Step 1 of the proof of ILemma 21 

^^The index does not depend on the initial value of R fsee lLemma 511 . The two expressions for G in IDehnition 61 
are shown to be equivalent in El Karoui and Karatzas [iH ]. 
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(b) The strategy optimal, where 


T* = inf{t > 0 : V{Pt,Ht) <k]= inf{t > 0 : G{Pt, Ht) < k]. 


Remark 6. • The main value of [Theorem 21 is that it reduces the restless bandit problem to 

an optimal stopping problem. This exhibits the structure of optimal strategies in terms of 
a decision frontier (see [PropositionT ). Moreover, the stopping problem can be solved more 
easily by a variety of specialized methods (see e.g. Peskir and Shiryaev Chapter IV]). 


The intuition behind lTheorem~^ is that choosing the risky arm early rather than late has two 
advantages: hrst, it reveals useful information about the hidden state X early on, and second, 
it makes future rewards from the risky arm more profitable without depreciating rewards from 
the safe arm. 


The elimination of the state variable r is possible because of Assumption 10, which asserts 
that the drift, volatility, and jump measure of the reward process only depend on P and H. 

At the heart of [Theorem 21 lies the assertion that any optimal control of the discretized 


problem can be transformed into a stoppiM rule of at least the same value (jLemma 8[) . The 
argument is based on Berry and Fristedt [71, Theorem 5.2.2], but in our setting rewards may 
depend on the history of experimentation with the risky arm. This dependence is subject 
to the monotonicity properties in [Assumption 12 Our proof shows that these properties are 
exactly what is needed to adapt the argument of Berry and Fristedt to a restless bandit 
setting. 

The strategy U* is well-dehned and optimal for the separated problem as well as the problem 
with partial observations. 


[Th 


eorem 


follows from a sequence of lemmas, which can be found in Appendix E in the Ap¬ 


pendix. The following proof explains the role that each individual lemma plays. 


Proof of \Theorem .2[ The value function does not depend on the initial value of R by [Lemma's! 
Therefore, it can be written as V{p, h). The discrete-time value function V^{p, h) is non-decreasing 
in {p, h) and convex in p. This is established in [Lemma '61 using the monotonicity properties in 
Assumption 12 The result is used in [Lemma 7[ to prove a sufficient condition for the optimality of 


the risky arm in the discretized problem: if the myopic payoff is higher for the risky than for the safe 
arm, then choosing the risky arm is uniquely optimal. This sufficient condition is used in [Lemma 8[ 
to prove that V^{p,h) is a supremum of values of stopping rules. The approximation result of 
[TheorenTTl implies that V{p,y) is also a supremum of values of stopping rules. The stopping time 


T* = inf{t > 0: V{Pt,Ht) < k} is optimal bv [Lemma 91 The alternative characterization of T* 
in terms of Gittins’ index is well-known, see e.g. Morimoto Theorem 2.1] or El Karoui and 


Karatzas 


16l . Proposition 3.4]. 


□ 
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An immediate consequence of ITheorem 21 is a characterization of optimal strategies by a curve 
which is typically called the decision frontier. 

Proposition 1 (Decision frontier). There is a curve in the {p,h)-domain such that it is optimal to 
invest in the risky arm if lies to the right and above of the curve. Otherwise, it is optimal 

to invest in the safe arm. 

Proof. The value function V(p, h) is non-decreasing in its arguments bv ILemma Gl and bounded from 
below by the constant k. The desired curve is the boundary of the domain {{p, h) : V{p, h) > k}. 
The characterization of optimal strategies via the position of {Pt,Ht) relative to the curve follows 
from [Theorem 21 □ 


4.4 Indexability. 

Another consequence of [Theorem 21 is the indexability of our restless bandit model in the sense of 


Whittle 


79|. 


Definition 7 (Indexability). Consider a two-armed bandit problem with a safe and a risky arm. 
The bandit problem is called indexable if the set of states where the safe arm is optimal is increasing 
in the payoff k of the safe arm. 


Proposition 2 (Indexability). The restless bandit model of Section 4-1 is indexable. 


Proof. Gittins’ index G{p, h) depends only on the payoff of the risky arm. Therefore, the set 
{{p, h): G{p, h) < k} where the safe arm is optimal has the required properties. □ 


4.5 Asymptotic learning. 

Definition 8 (Asymptotic learning and infinite investment). For any uj G D, we say that asymptotic 
learning holds if Pt{^) = X{uj). We say that the agent invests an infinite amount of time 

in the risky arm if Ut{uj)dt = oo. 

Assnmption 13 (Bounds on the flow of information). The initial belief is non-doctrinaire, i.e., 
Pq G (0,1). The measures K£i{l,h,-) and Kji{0,h,-) are equivalent, for all h G M. The function 
<h(l, •) defined in Assumption Sf is bounded from below by a positive constant. 


Proposition 3 (Asymptotic learning). Under AssumptionsUf^Tffl the following statements hold: 

(a) Under any control, asymptotic learning occurs if and only if the agent invests an infinite 
amount of time in the risky arm. 


(b) Under the optimal control of \Theorem M asymptotic learning takes place if and only if {P,H) 
remains above the decision frontier for all time. 

Proof, (a) follows from [Lemma 101 (b) follows from (a) and the characterization of optimal controls 
in |Proposition l| □ 
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Remark 7. • The limit limi_j.oo Pt exists almost surely because P is a bounded martingale. If 

the belief Pq £ {0,1} is doctrinaire, then the belief process P is constant and equal to the 
hidden state X. 


• Agents can learn their true type X in two ways: either through a jump of the belief process 
to A, or through convergence to X without a jump to the limit. The first kind of learning 
is excluded by the equivalence of and The second kind of learning 

is characterized by divergence of the Hellinger process of the measures Pi and Pq. The 
Hellinger process is closely related to the function which can be interpreted as the 

informativeness of the arm u about the state X. The upper and lower bounds on <I> in 
Assumptions 0 and [TJ] establish an equivalence between divergence of the Hellinger process 
and divergence of the accumulated amount of investment in the risky arm (see ILemma lOp . 


If the measures P_r(1, H, •) and P, •) are not equivalent, the belief process P jumps to 

the true state X with positive probability on any finite interval of time where the risky arm 
is chosen. For example, this is the case in the exponential bandits model of Keller, Rady, and 
Cripps 32]. 


Proposhion 3 can be contrasted with the strategic experimentation model of Bolton and 
Harris 0] and the social learning model of Acemoglu et al. [l|, Example 1.1]. In these models, 
asymptotic learning always takes place because agents continuously receive information about 
the hidden state, regardless of whether they choose to invest or not. 


4.6 Comparison to the full-information case. 


By the full-information case, we mean the bandit model where the otherwise hidden state variable X 
is fully observable. This model is equivalent to the model with partial observations and Pq G {0,1}. 
It follows from [Theorem 21 and the monotonicity condition in Assumption 12 that the optimal 
strategy in the full-information case is constant in time and given by '^v(x,Ho)>k- 


Definition 9 (Asymptotic efficiency). For any t > 0 and cu G II, Ut{oj) is called efficient if it 
coincides with lv{x{u]),Ho)>k- Moreover, U{uj) is called asymptotically efficient ifUt{oj) is efficient 
for all sufficiently large times t. 


Assnmption 14 (Decision frontier stays away from p = 0 and p = 1). There is e > 0 such that 
for all h gM, V{e,h) = k and V (1, h) > k. 

Proposition 4 (Asymptotic efficiency). Let Assumptions [Hfm hold, let U he the optimal strategy 
provided hv \Theorem H and assume that (Po,Po) ^i^s above the deeision frontier. Conditional on 
X = 0, asymptotie efficiency holds almost surely. Conditional on X = 1, however, asymptotie 
efficiency may hold and fail with positive probability. 


Proof. If A = 0, investment in the risky arm can’t continue forever. Otherwise, Pt would converge 


to zero by [Proposition 3[ As the decision frontier is strictly bounded away from the set p = 0, 


21 












would eventually drop below the decision frontier, a contradiction. Thus, investment stops 
at some finite point in time. This is efficient given X = 0 because 1^(0, Hq) = k. 

If X = 1, then {P,H) may or may not drop below the frontier at some point in time. Both 
cases may happen with positive probability. In the former case, the agent stops investing, which 
is inefficient because > 0. In the latter case, the agent never stops investing, which is 

efficient. □ 


Remark 8. • Efficiency holds if there is some time t where the agent’s plan for future invest¬ 

ments is the same as if she had known X from the beginning. Of course, this still leaves open 
the possibility that some early investment decisions were inefficient. 


The intuition behind Proposition 4 is that a sequence of bad payoffs can lead agents to refrain 
from experimentation with the risky arm. For agents of the type X = 0, this is efficient, but 
for agents with X = 1, it is not. In this regard, the restless bandit model behaves as a 
standard bandit model. 


• It follows that in the long run, compared to a setting with full information, agents invest 
too little in the risky arm. This points to the importance of policies designed to increase 
investment in the risky arm. 


Assumption 14 limits the influence of H on the rewards from the risky arm: the safe arm is 


optimal if X = 0 is known for sure, regardless of how high H is, and similarly the risky arm 
is optimal if X = 1 is known for sure, regardless of how low H is. 


Without Assumption 14 it is still possible to characterize asymptotic efficiency using the 
necessary and sufficient conditions of ILemma 101 but there are more cases to distinguish. 
Some of them have no counterpart in classical bandit models. For example, there can be low- 
type agents who invest in the risky arm at all times. This can be either efficient or inefficient, 
depending on whether 1/(0, Rq) exceeds k. Similarly, it can be efficient or inefficient for 
high-type agents to stop investing, depending on 1/(1, Rq)- 


4.7 Evolution of a population of agents. 

Assume that there is a population of agents with initial states {Po,P[q), which might vary from 
agent to agent. Moreover, assume that agents have independent types, such that learning from 
others is impossible. Alternatively, learning could be precluded by making actions and rewards 
private information. Then all agents behave as in the single player case. The distribution of agents 
in the (p,/i)-domain evolves over time and converges to the distribution of (Poc-ffoo)- 

Proposition 5. Let Assumptions{]\\14\hold, letp*{h) denote the decision frontier, and consider a 
population of agents with {Pq,Hq) above the decision frontier. 
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(a) In a restless bandit model with /3{0,h) < 0 < /I{l,h), {Poo, Hoc) satisfies 


Poo £ [0,P*(—oo)] and Hoc = —oo or Poo = 1 and Hoc = oo. 


(b) In a classical bandit model with /3(0, h) = /3(1, h) = 0 and AP > —e for some e > 0, 


Poo G [p*{Ho) - e,p*{Ho)] U {1} and Hoc = Hq. 


In particular, agents in models without jumps either end up right at the decision frontier 
{p*{Ho), Hq) in finite time or converge to {1 ,Hq). 

Proof, (a) If {P,H) drops below the decision frontier, P is frozen and H decreases to —oo. Other¬ 
wise, P increases to 1 and H to oo. (b) H is constant and P either converges to 1 or drops below 
the decision frontier and remains there forever. □ 


Remark 9. 


Proposition 5 shows that agents in classical bandit models accumulate at or near 


the decision frontier, whereas they drift away from the frontier in restless bandit models. This 
leads to different predictions about the effectiveness of incentive schemes designed to increase 
investment in the risky arm. 


• To wit, consider a subsidy for investment in the risky arm or, alternatively, a penalty for 
investment in the safe arm. These incentives lower the decision frontier in the {p, /i)-domain. 
Some agents, who were previously below the frontier, will now find themselves above the 
frontier and will find it optimal to start investing in the risky arm again. The number of 
such agents can be expected to be very small in restless bandit models because agents keep 
drifting away from the frontier once they stopped investing in the risky arm. Consequently, 
incentives have negligible effects on average investment, in particular if they are carried out 
late in time. In contrast, in classical bandit models even small shifts of the decision frontier 
have large effects on average investment because there are many agents at or near the frontier, 
namely all agents who ever stopped investing in the risky arm. 


• Thus, our model provides an explanation for the ineffectiveness of subsidies designed to boost 
investment in projects with uncertain payoffs. Our explanation does not rely on switching 
costs. 


5 Conclusions. 

We presented an extension of classical bandit models of investment under uncertainty motivated 
by dynamic aspects of resource development. The extension is new and has economic signihcance 
in a wide range of real world settings. 

We dealt with the delicate issue of setting up the control problem with partial observations 
in continuous time. As explained in [Section 2.21 recent standard formulations of optimal control 
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under partial observation do not apply in our general setting. In addition to its importance to the 
theory of optimal control, our solution is also a contribution to the bandit literature. 

Our framework encompasses both the exponential bandit model of Keller, Rady, and Cripps 


32l |. where jumps can occur only for high type agents, and the Poisson and Levy bandit models of 


Keller and Rady 3l|, [30| and Cohen and Solan 11[, where it is assumed that one jump measure is 


absolutely continuous with respect to the other. 

We solved the restless bandit model by an unconventional approach. Instead of using the HJB 
equation or a setup using time changes, we discretized the problem in time and showed that any 
optimal strategy can be modified such that the agent never invests after a period of not investing 
and such that the modified strategy is still optimal. 

Our models constitute a new class of indexable restless bandit models. While other classes 
of indexable bandits are known, they either involve no learning about one’s type (Glazebrook, 
Kirkbride, and Ruiz-Hernandez 23]), do not allow history-dependent payoffs (Washburn 0) , or 
are restricted to very specific reward processes (e.g. finite-state Markov chains as in Nino-Mora 


m. 


A Notation. 

For any Polish space S, R(S) will denote the space of M-valued Borel-measurable functions on §, 
(^(S) the continuous functions, C'b(S) the bounded continuous functions, and P(S) the space of 
probability measures on S. D^[0, oo) denotes the space of S-valued cadlag functions on [0, oo) with 
the Skorokhod topology, L§[0,oo) the caglad functions, and C§[0,oo) the subspace of continuous 
functions. If S is endowed with a differentiable structure, then C'^(S) denotes the functions with k 
bounded continuous derivatives. 

Throughout the paper, all filtrations are assumed to be complete, and all processes are assumed 
to be progressively measurable. The law of a random variable X is denoted by C{X). The comple¬ 
tion of the filtration generated by a process Y is denoted by {J-J}. If Y has left-limits, they are 
denoted by K_, i.e., Yt- = lims/'f 1). If Y is of finite variation, Var(K) denotes its variation process. 
H •Y denotes stochastic integration of a predictable process H with respect to a semimartingale 
Y and H * ^ with respect to a random measure /i. I denotes the identity process It = t. When T is 
a stopping time, we write Y'^ and for the stopped versions of Y and //. Stochastic intervals are 
denoted by double brackets, e.g., [[0,T]| C [0,oo] x Q. denotes the continuous local martingale 
part of Y. A superscript T denotes the transpose of a matrix or vector. 


B Controlled martingale problems. 

Definition 10 (Martingale problem for (M, T)). Let F be a caglad adapted \I-valued process on 
the space Dy[0 , oo) with its canonieal filtration. 

(i) {X, Y, T) is a solution of the stopped martingale problem for (M, F) if there exists a filtration 
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{Ft}, such that X is an Fo-measurable X-valued random variable, Y is an {Ft}-adapted 
cddldg Y-valued process, T is an {Ft}-stopping time, and for each f G 'D{A), 

ptAT 

fix, Yt^r) - fix, Yo) - / AfiFiY)s, X, n)ds (1) 

Jo 

is an {Ft}-martingale. 

(a) If T = oo almost surely, then iX,Y) is a solution of the martingale problem for iA,F). 

(Hi) iX,Y) is a solution of the local martingale problem for iA,F) if there exists a filtration {Ft} 
and a sequence of {Ft}-stopping times {Tn} such that oo almost surely and for each n, 

iX,Y,Tn) is a solution of the stopped martingale problem for iA,F). 

(iv) Local uniqueness holds for the martingale problem for iA, F) if for any solutions {X', Y', T'), 
iX", Y", T") of the stopped martingale problem for (^, F), equality of the law of iX', Yq) and 
iX", Yq) implies the existence of a solution {X, Y, S' V S”) of the stopped martingale problem 
for {A,F) such that (Xas',*S") has the same distribution as {X'^j.,,T'), and {X.^^s", S") has 
the same distribution as {X"^j.„,T”). 

(v) The martingale problem for {A,F) is well-posed if local uniqueness holds for the martingale 
problem for {A,F) and for each v G P(X x Y), there exists a solution {X,Y) of the local 
martingale problem for {A,F) such that the law of {X,Yq) is u. 


Definition 11 (Martingale problem for (Q,F)). Let F be a cdgldd adapted \I-valued process on 
D[op]xy[0 ) oo) with its canonical filtration. 

(i) {P,Y,T) is a solution of the stopped martingale problem for {G,F) if there exists a filtration 
{Ft}, such that {P,Y) is an {Ft}-adapted cddldg [0,1] x Y-valued process, T is an {Ft}- 
stopping time, and for each f G P{Q), 

ptAT 

fiPtAT, W) - fiPo, Yo) - / AfiFiP, Y)„ Ps, Ys)ds (2) 

Jo 


is an {Ft}-martingale. 


(ii) Solutions of the (local) martingale problem, local uniqueness, and well-posedness are defined 


in analogy to Definition 10 


C Noncommutativity of filtering and relaxation. 

To see the non-commutativity between filtering and relaxation, let us tentatively define relaxed 
controls with partial observations as tuples {A,X,Y) such that for each / G D{A), 

f{X, Yt) - f{Xo, To) - r / Af{u, X, y,_)A,(dn)ds (3) 

Jo Jv 
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is a martingale, where A is a {J^^^}-predictable P(U)-valued process. If a well-posedness condition 
similar to the one in IDefinition 21 holds and Pt = E[X | is the filter, then it can be showr@ 
that a jump Al^ of the observable process leads to a jump = j(At, AYt) of the filter, 

where 


j{\p,y,z) 


fiiP(l) 2 {u,y,z)X{du) 


fv 


i-,V,z) + (1 -p)(2 - 4>2{u,y,z 


P- 


( 4 ) 


Thus, APt is uniquely determined by AYt and the information before t. In contrast, this is not 
the case in the relaxation of the separated control problem, where a jump AYt can lead to different 
values of AP^. Indeed, the jump measure of {P,Y) is compensated by the predictable random 
measure 

iy{dp,dy)= / Pi_, T*-, dy)At(du). (5) 

Jy 

An interpretation is that the two cases differ in how uncertainty regarding u is handled. In the 
former case, the control u in the support of A^ is treated as unknown in the process of updating the 
filter. Therefore, the jump height of the filter depends on A^, but not on a random choice of u in 
the support of A^. In the latter case, however, u is treated as known but random. Different choices 
of u in the support of A* might lead to different probabilities for a jump AYt, and consequently to 
different jumps of the filter. 


D Proofs of ISection 3L 


Lemmas [TH3] below are used to establish [Theorem II Assumptions [THE] are in place. 

Lemma 1 (Filtering). If {U,X,Y) is a control with partial observations and P is a cddldg version 
of the martingale E[A | Pj], then {U,P,Y) is a separated control of the same value as {U,X,Y). 

Proof. Step 1 (Filter as change of measure from P to PiJ. If Pq G {0,1}, then Pt = Pq is constant 
and equal to A. In this case it is trivial to check that (JJ,P,Y) is a separated control of the 
same value as {U,X,Y). In the sequel, we assume that 0 < Pq < 1. Then the measure P can be 
conditioned on the event X = x, for all x G X. This yields measures P^; such that 


Pi(A = l) = l, Po(A = 0) = l, P = PoPi + (l-Po)Po. (6) 

The process P/Pq is the {p(" }-density process of Pi relative to P because for all A G p(" , 

[ PtdP= [ E[A|Pi^]dP= [ AdP = PoPi(A). (7) 

Ja Ja Ja 

Step 2 (Stochastic exponential relating the martingale problems under P and Pij. For each 
/ G V[A), let Af be the average of Af over x G X with weights p and (1 — p), 

Af{u,p, y) = pAf{u, 1, y) + (1 - p)Af{u, 0, y). (8) 

^^This follows by adapting the proof of ILemma ll to relaxed control processes. 
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Let / G T>{A) and set g{x, y) = /(I, y). Then g G 'D{A) and g is constant in x G X. Bv IDefinition 21 
the process 

g{l,Y)-g{l,Yo)-Ag{U,X,Y).I (9) 


is a martingale under P. Taking {J-/}-optional projections, one obtains that the process 

M = 5(1, T) - 5(1, To) - Ag{U, P,Y) • I 


is an {T’^^j-martingale under P. Moreover, as JT = 1 holds Pi-a.s., the process 


M = 5(1, Y) - 5(1, To) - Ag{U, 1,Y) • I 


( 10 ) 


( 11 ) 


is an }-martingale under Pi. The difference between these two processes is given by 

M-M = 9,5(1, X) {m 1, T) - P{U, P, Y)) • I 

+ [ ( 5 ( 1 , T + 2 ) - 5 ( 1 ,1^) - 5 , 5 ( 1 , Y)xiz)) {K{U, 1, y, dz) - K{U, P, Y, dz)) . I. (12) 

JY 

For any p > 0, let V’l and ■02 be defined by 

4>2{u,y,z) 


0i(u,p,5) = (1-p)0i(n,5), 'ip2{u,p,y,z) =- - 

phiu^y^z) + (1 -p)(2 - (j) 2 {u,y,z)) 

where 0i, 02 stem from Assumption 3 Then the following relations hold for any p > 0: 

f3{u, 1,5) -P{u,p,y) = a^(n,5)0i(u,p,5) + / {'ijj 2 iu,p,y, z) - l)xiz)K{u,p,y,dz), 
K{u, l,y,dz) = 'ip 2 {u,p,y,z)K{u,p,y,dz) 

For any n G N, let be the stopping time 

Tn = inf{t > 0: Pi < 1/n or Pi_ < 1/n or 1101 > n} A n. 


(13) 


(14) 


(15) 


Since P > 0 holds on any interval |0,rn|. Equation (14) can be used to rewrite Equation (12) as 


mT. _ = 9,5(1, y) {a\U, Y)MU, P, 1")) l[o,T„] • I 

+ [ {g{l,Y + z)-g{l,Y)){MU,P,Y,z)-l)K{U,P,Y,dz)lp,Tr.l*I- (16) 

Jy 

Let pL be the jump measure of Y, v its {pj }-compensator under P, and 

P" = 0 i(P,P_,y_)l[o,rj (02(P,P-,y-,^) - 1)1[0,T„] * {p-iy){dzAt)- (17) 

Then P” is a local {P ^}-martingale under P. Keeping track of the terms in Itb’s formula the same 
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way as in the proof of Jacod and Shiryaev 27|, Theorem II.2.42] shows that 

M = dyg{l,Y_).Y^+{g{l,Y_+z)-g{l,Y_)) * (/x - i/) (dz, df) 


( 18 ) 


is the decomposition of M into its continuous and purely discontinuous local martingale parts. It is 
now easy to calculate the predictable quadratic covariation of and L"^. Indeed, a comparison 


with Equation (16) shows that 


Equivalently, letting D” = T(L”) denote the stochastic exponential of L"", 


(19) 


( 20 ) 


Step 3 (Martingale property of stochastic exponential). We will show that the local martingale 
is a martingale by verifying the conditions of Lepingle and Memin Theoreme IV.3]. For 
any w G [0,1], 

__ < P^'^-P^ < 1 ( 21 ) 

pw + {I — p){l — w) p A {1 — p) 

holds because the nominator on the left-hand side is a convex combination of p and (1—p). Replacing 


w by (f)2{u,y,z)/2 in Equation (21) one obtains 


{'ip 2 {u,p,y,z) - l)^ = 


2 {l-p){(j) 2 iu,y,z) - 1) 


p 4 > 2 iu,y,z) + {l-p){2- (f> 2 {r 


\ ^ 

-^ <^{Mu,y,z)-i)\ ( 22 ) 

,y,z)) J p^^ 


This inequality relates the values of </> 2,^/’2 under the transformation w {w — 1)^. It can equiva¬ 
lently be expressed in terms of the functions w w log(r/;) — tc-l-lortci-^l — \/w{2 — w) because 
for all w G [0, 2], 


w log(t(;) — w + 1 < {w — 1)‘^ < ‘i{w log(tc) — tc -t- l), 
1 — i/ w{2 — w) < (rc — 1)^ < 2 ^1 — w{2 — w)^ . 


(23) 

(24) 


Actually, the first inequality in Equation (23) holds for all w > 0, which implies that 


j {iJ 2 {u,p,y,z)\og (V’ 2 (u,p,y,z)) - ^f 2 {u,p,y,z) + l^K{u,p,y,dz) 

< J {f^ 2 {u,P,y,z) - l)^K{u,p,y,dz) < ^ J {(j) 2 {u,y,z)-l)^K{u,p,y,dz) 

- ^ j \l(t>2{u,y,z){2 - (() 2 (u,y,z))^ K{u,p,y,dz) 

(^1 - \l ((2{u,y,z){2 - 4>2{u,y,z))^ K{u, ^,y,dz). (25) 
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By Assumption 5 this expression is bounded as long as p stays away from zero. Moreover, by the 
same assumption, the following expression is bounded: 

'^i{u,P,y)^ cr‘^{u,y)'ilJi{u,p,y) = {I - pf(j)i{u,y)^ a‘^{u,y)(j)i{u,y). (26) 


Therefore, 


E 


exp 


— L"'’'^)oo + ((1 + z) log(l + z) — 


< oo. 


(27) 


IS a 


which is the condition of Lepingle and Memin 4J, Theoreme IV. 3] implying that = £{U' 
uniformly integrable martingale. Therefore, is a probability measure. 

Step 4 (Identification of stochastic exponential and filter). By IDefinition 21 U = F{Y) for a 
process F on Dy[0,oo). We will use the well-posedness of the martingale problem for (A,F) to 
show that agrees with Pi on By Girsanovs’ theorem and Equation (20), is an 

{F ^}-martingale under Dtf 


The process M can be written as 


M = /(I, V) - /(I, Vo) - Af{u, 1, y) . / 


(28) 


because A has no derivatives or non-local terms in the x-direction. As / G II[A) was chosen arbi¬ 
trarily, the tuple (l,y) under the measure solves the martingale problem for {A,F) stopped 

at Tn- The same can be said about the tuple {X,Y) under the measure Pi. Moreover, the dis¬ 
tribution of (l,yo) under coincides with the distribution of {X,Yq) under Pi. According to 

IDefinition 21 local uniqueness holds for the martingale problem. It follows that coincides 

with Pi on F^^. The characterization of P/Po as the density process of the measure Pi relative to 
P obtained in Step I implies that P = PoDJf^ holds on |0,T„]]. 

Step 5 (Filter solves the martingale problem with generator Q). To show that {U,P,Y) is a 
separated control, one has to prove that for any / G P{G), 


N = f{P, Y) - f{Po, To) - Gf{U, P,Y)*I 

is an {Jq^}-martingale under P. On the interval [0, P agrees with PqD^ and consequently 
satisfies P = P- • L'^. Therefore, the jumps of P on this interval are 


AP = P_ AL^ = P_{MU,P-,Y_,AY) - l)lAY^o=j{U,P-,Y_,AY)lAY^o, (29) 


where the function j is defined in Assumption 4 


Moreover, on the same interval |0,T„]], 


(PfiP^) = Pf . = Y,P-i^iAU,P-,Y-)fiijiU,P.,Y_) . (YlfiYP^) 

ij 

= P\1 - PfMU, Y^aiU, YfMU, Y) . I 
{PfiY^) = = p2^i(C/,P_,y_)^ • (y=,y^) = P^fii{U,P,Yya^{U,Y) • I 

{YfiY^) = a\U,Y)» I. 
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It follows from Ito’s formula and the definition of Q in Assumption 4 that the stopped process 
is an {TY }-local martingale under P. It is also bounded by [Assumption 2\ so it is a martingale. 
Setting g{x,y) = f{0,y), one has g G and the process 


M = g{0, Y) - g{0, Yq) - Ag{U, P,Y) • I 


(31) 


is a martingale. Then it holds for any bounded stopping time S and each n G N that 

= E[A'5Ar„ + NsvTn - = IE[IVsat„ + MsvTn - -^r„ + Rn] = ]E[l?n], (32) 

with a remainder given by 


Rn = {Ns\/t„ - Nt„) - {Ms\/t„ - MtY)- 


(33) 


Let w G II and 

T = lim Tn = inf{t > 0: Pj = 0 or Pt- = 0}. (34) 

n^oo 

If Pt-{u}) = 0, then Tn{oj) < T{oj) holds for all n G N. Otherwise, there is /c G N such that 
Tn{uj) = T{uj) holds for all sufficiently large n. Therefore, 


lim Rn 

n^oo 


— Nx—) — (ilTp— — 

< {Ns-Nt-)-{Ms-Mt-), 

, (A'svT - Nt) - {MsyT - Mt), 


if Pt- = 0 and t < T, 
if Pt- = 0 and t > T, 
if Pt- Y 0 . 


(35) 


It can be seen from the definitions of A and G in Assumptions [T] and 0] that Gf{u, 0, y) = Ag{u, 0, y). 
Therefore, N = M holds on the interval [T, oo|, where P = 0. Moreover, Nt- = Mt- holds if 
Pt- = 0. This implies that lim^-j-oo Rn = 0. The processes and are bounded, which follows 
from I Assumption 2|and the boundedness of S. Therefore, 


Rn = '^Tr,<s{{^S - IVsaT„) - {Ms - MsatY) 


(36) 


is bounded by a constant not depending on n. By the dominated convergence theorem, E[A^ 5 ] = 
lim„_>.oo E[P„] = 0. As this holds for all bounded stopping times S, we conclude that A" is a 
martingale. As / G P{G) was chosen freely, {U, P, Y) is a separated control. 

Step 6 (Value of separated control). {U, P, Y) has the same value as {U, X, Y) because 6(P, P, Y) 
is the {PY }-optional projection of b{U, X,Y). □ 


Lemma 2 (Approximation). Separated eontrols ean be approximated arbitrarily well in value by 
separated step controls: 

V^^-{p,y) = snpV^^-’Yp,y)- (37) 

5 

Here denotes the value function obtained by admitting only processes U which are piecewise 
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constant on an equidistant time grid of step size 6 > 0 in the separated control problem. 

Proof. Step 1 (Filtered martingale problem). Let U be deterministic and let {P,Y) be a cMlag 
process with values in [0,1] x Y. We identify P with the P(X)-valued process II given by 


nt(dx) = Pt6i{dx) + (1 - Pt)5o{dx), 


(38) 


where 6^ denotes the Dirac measure at x G X. In line with Kurtz and Ocone 
(n, y) is a solution of the filtered martingale problem for (^, U) if 


371], we say that 


f{x,Y)Uidx)- / /(x,yo)no(dx)- / Af{U,x,Y)Uidx)^I 


(39) 


is a martingale, for each / G VIA), and II is {YY }-adapted. 

We will use Kurtz and Nappo [401 . Theorem 3.6] to show that uniqueness holds for the filtered 
martingale problem. Thus, we have to verify points (i)-(vi) of Condition 2.1 in this paper. These 
are conditions on the operator Af{U, x, y) in |(39)[ interpreted as a time-dependent generator of 
(X, Y). To put everything into a time-homogeneous framework, we work with the time-augmented 
process {I,X,Y). Its generator A^ is given by 


V{A^) = C^{RxXxY), 


X, y) = dtg{t, X, y) + Agt{Ut, x, y), (40) 


where gt{x,y) = g{t,x,y). For point (i), there is nothing to prove. For point (ii), one has to show 
that Af{u,x,y) is continuous in {u,x,y), for each / G V(A). To see this, let 


m{u,x,y) 

f{u,x,y,z) 

K{u,x,y,dz) 


1+ [ {\z\‘^ Al)K{u,x,y,dz), 

jRd 

, .f{x,y + z)-f{x,y)-dyf{x,y)x{z) 
m{u, X, y) -- 

[\z\^ A l)K{u,x,y,dz) 

m{u, X, y) 


Then everything is set up such that 


(41) 

(42) 

(43) 


/ {fix,y + z)-f{x,y)-dyf{x,y)xiz))K{u,x,y,dz)= / f{u,x,y,z)K{u,x,y,dz)- 

Jy Jy 


(44) 


Now let {un,Xn,yn)n€N be a sequence in U x X x Y converging to {u,x,y). By Assumption 6 


m{u, X, y) is continuous and the measures K{un, Xn, yn, dz) are weakly convergent. A version of Sko- 
rokhod’s representation theorem for measures instead of probability measures (for example Startek 
69|) implies that there are mappings {Zn)nef^ and Z with values in Y, all defined on the same mea¬ 
sure space with finite measure, such that for each n gN, Zn has distribution K{un,Xn,yn,dz), Z 
has distribution K{u, x, y, dz), and Z^ ^ Z almost surely. By the dominated convergence theorem, 


^[h'^n,Xn,yn,Zn)] E [/(w, X, y, Y)] , 


(45) 
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which shows that the expression in | (44) | is continuous in (u, x, y). This settles point (ii). Point (hi) i; 
satisfied with ijj = 1 by [Assumption % Points (iv) and (vi) are satisfied for T>{A^) = C^(]R x X x Y) 


Finally, point (v) is satisfied because of Assumption 8, which guarantees that for each constant, 
deterministic control U and all initial conditions, there exists a cadlag solution of the martingale 
problem for (cf. the discussion before Theorem 2.1 in Kurtz 38|). Moreover, by Assumption 8 


uniqueness holds for the martingale problem for A^, which coincides with the martingale problem 
for (M, U) from IDefinition ll Thus, all conditions of Kurtz and Nappo 40|, Theorem 3.6] are fulfilled 
and uniqueness holds for the filtered martingale problem. 

Step 2 (Projecting separated controls to solutions of the filtered martingale problem). Let 
{U,P,X) be a separated control with deterministic control process U. Let / E PiG) be affine 
in the first variable p, i.e., 

f{p,x) =pf{l,x) + {l-p)f{0,x). (46) 

Then Gf{u,p,x) is also affine in p, i.e., 

Gf{u,p, x) = pGf{u, 0, x) + (1 - p)Gf{u, 0, x) = pAf{u, 0, x) + (1 - p)Af{u, 0, x). (47) 

This can be verified using the definition of and (/> 2 , noting that all quadratic terms in p cancel 
out in the expression of Gf{u,p, x). Identifying P with II as in Step 1, one obtains that the process 

f{P, Y) - /(Po, >()) - Gf{U, P,Y).I 

= [ f{x,Y)U{dx)- [ f{x,Yo)Uo{dx)- [ Af{U,x,Y)U{dx)*I (48) 
Jx Jx Jx 

is a martingale. If II denotes the {J -)^}-optional projection of II, then 


/(x,K)n(dx)- / /(x,Ko)no(dx)- / M/([/,x,K)n(dx)./ 


(49) 


is also a martingale. Thus, (fi, Y) is a solution of the filtered martingale problem. By the previous 
step, the law of (II, Y) is uniquely determined. An important consequence is that all separated 
controls sharing the same deterministic control process U have the same value: 


E 


roo 1 r poo ^ 

/ pe-P^b{Ut,Pt,Yt)dt =E / pe-P^b{Ut,Pt,Yt)dt 
Jo J L7o 


= : JiU), 


(50) 


LJO 

where P is the {J -(^}-optional projection of P. 

Step 3 (Tightness of separated controls). Let (U"', P^,Y'^) be separated controls with determin¬ 
istic control processes and let C/” —)• C/ in the stable topology, i.e.. 


poo poo 

/ 9 iUfi,t)dt^ / g{Ut,t)dt 
Jo Jo 


(51) 


/o JO 

for all bounded measurable functions g: U x [0, oo) —)• M with compact support which are continuous 
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in u. The stable topology coincides with the vague topology, checked on continuous functions with 
compact support. For more details on the vague and stable topology we refer to El Karoui, Nguyen, 


271 . Theorem 


and Jeanblanc-Picque 171 and Jacod and Memin H- We will use Jacod and Shiryaev 
IX.3.9] to show that the laws of {P'^,Y'^) are tight. Thus, we have to verify the conditions of this 
theorem. By the same estimates as in Step 3 of the proof of ILemma 11 one obtains that 

/ [\j{u,P,V,z)\^ + \z\^) A 1 K{u,p,y,dz) 

Jy 

< “2 J^j{u,p,y,zf K{u,p,y,dz) + 2 [ A 1 K{u,p,y,dz) 


(52) 


< 2 


/ {4>2{u,y,z) - 1)^ K{u,p,y,dz) + 2 A 1 K{u,p,y,dz) 

Jy Jy 

- “ \l(t>2{u,y,z){2 - 4)2iu,y,z))'^ K{u,p,y,dz) + 2 jjz\‘^ A 1 K{u,p,y,dz). 

By Assumptions [2] and m the integrals on the last line above are bounded by a constant which does 
not depend on {u,p,y). By Assumptions [2] and [5l also the drift and the diagonal entries of the 
volatility matrix 


I3{u,p,y), 


p\l - pf(t)i{u,y)^ (T'^{u,y)(t>i{u,y), 


a‘^{u,y) 


(53) 


are bounded by a constant not depending on {u,p,y). It follows that Condition IX.3.6 (the strong 
majoration hypothesis) is satisfied. Condition IX.3.7 (the condition on the big jumps) follows from 
Assumption 7[ By the stable convergence of to U, using [Assumption 6| and the bounds which 
were just shown, the following convergence holds for all t > 0, {P, Y) E 7J[o,i]xy[0) oo), and functions 
g E Cft([0,1] X Y) vanishing near the origin: 


MU^,P,Y).It^mP,Y)*It, 

p2(i _ pfMU^,Y)'^a\U^,Y)MU",y) • It ^ ^’'(l - PfMU,Y)'^(T\U,Y)MU,Y) • h 
P(1 - P)a^iU^,Y)MU^, Y).It^ P{1 - P)a\U, Y)MU, Y) • h 
a^{U^,Y)*It^a\U,Y)*It, 


[ g{j{U^,P,Y,z),z)KiU^,P,Y,dz)*It^ [ g{j{U, P,Y, z), z) KiU, P,Y,dz) 
Jy Jy 


h. 


(54) 

It follows from Lemma IX.3.4 that the conditions of Theorem IX.3.9 are satisfied. Thus, the laws 
of {P'^,Y'^) are tight. Moreover, any limit {P,Y) of a weakly converging subsequence of (P'^,Y^) 
solves the martingale problem for {G,U) and defines a separated control (JJ,P,Y). This follows 
from Jacod and Shiryaev 27|, Theorem IX.2.11] by the same assumptions. 

Step 4 (Step controls). For any J > 0, the mapping 


: Liu[0, oo) —Lu[0, oo), ('k'^t/)* = ^ Uisl(is,{i+i)s] {t) (55) 

i=0 
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approximates deterministic control processes by step control processes of step size 6. Indeed, 
lim, 5 _>.o(d'‘^t/)t = Ut holds for each t > 0. Moreover, by dominated convergence, converges 

stably to U. Let 

Lu[0, oo) = U{ : U G Lu[0,oo)| C Lu[0,oo) (56) 


<5>o 


denote the set of all step control processes. For any step control process U G L^[0, oo), there is a 


control with partial observations (U,X,Y) by Assumption 8 and a corresponding separated control 
{U,P,Y) bv ILemma 11 Let Qu denote the law of {P,Y) under U. If t/” G converges 

stably to a step control U G L^[0, oo), then Q[/n converges weakly to Qu by the arguments in Step 
2 and by [Assumption 9 ensuring uniqueness of the martingale problem for {Q,U). As continuity 
implies measurability, Q is a transition kernel from L^[0, oo) with the Borel sigma algebra of stable 
convergence to Skorokhod space D[oq]xY[0) cc). 

Step 5 (Approximation of deterministic controls). Let U G Lu[0,oo) and define 
for each n G N. By [Assumption ^ there are controls with partial observations ([/”, A”", y”) and 
bv ILemma II corresponding separated controls {U^,P'^,Y^). By the tightness result of Step 3, any 
subsequence along which J{U'^) converges contains another subsequence, still denoted by n, such 
that (P^,Y'^) converge weakly to some solution {P,Y) of the martingale problem for (Q,U). By 
Skorokhod’s representation theorem we may assume after passing to yet another subsequence that 
(pn, y'j are dehned on the same probability space and that (P"", Y"^) converge to (P, Y) 

almost surely. As AP^ = 0 holds almost surely for each fixed t > 0, it follows from the dominated 
convergence theorem and the pointwise convergence of UJf to Ut that 


/•OO 

lim J(P”) = / pe-P^ lim E r6(P^ 

n^oo Jq n —>-oo *- 


t t 


PtX 


poo 

')]dt= / pe-P^E[b{U,Pt,Yt)]dt = J{U). 
Jo 


(57) 


Step 5 (Approximation of arbitrary controls). The law of any separated control (U,P,Y) is 
a probability measure P on the space Pu[0, oo) x P[o,i]xy[ 0) oo). We will work on this canonical 
probability space in the sequel. Using disintegration, P can be written in the form 


P(dP, dP, dU) = P(dP)Pc/(dP, dU). 


(58) 


Accordingly, the value of the control can be expressed as 


J""-(P,P,y) = 


ILh [0,oo) 


poo 

/ pe-P%{Ut,Pt,Yt)dt 

Jo 


P(dP) 


(59) 


For P-a.e. U, the process {P,Y) under the measure P[/ solves the martingale problem for {Q,U). 
Moreover, the process U is deterministic under the measure P[/. By Step 2, all solutions of the 
martingale problem {G,U) with deterministic control process U have the same value J{U). This 


34 














allows one to express the value of the control as 


J^^-{U,P,Y)= / J{U)F{dU). 

J Lu[0,oo) 


(60) 


By Step 4 and dominated convergence, 

r^ {U,P,Y) = [ J(4'^/"C/)P(d[/) = lim 


(61) 


where {U"', P'^,Y^) is the coordinate process on Lu[0, oo) x Z?[o^i]xy[ 0, oo) under the measure 
Q^i/nf/(dP, dy)P(dC/). Thus, , P^,Y'^) is a sequence of separated step controls approximating 
{U, P,Y) in value. □ 

Lemma 3 (From separated to partially observed controls). For every separated step control, there 
exists a step control with partial observations of at least the same value, implying {p,y) > 

F--’'5(p,y). 

Proof. Step 1 (Reduction to Markovian step controls). To distinguish the separated and the par¬ 
tially observed versions of the problem, we will mark objects of the separated problem with a 
tilde. By Assumption 9[ the discretized separated problem is that of controlling the Markov chain 
(Pt. ,y)J, where {ti)i£^ is a uniform time grid of step size 5 > 0. It is well-known that optimal 


Markov controls exist for such problems (see e.g. Berry and Fristedt [7[ and Seierstad [6^). We 


will prove the lemma by showing that every Markov control for the discretized, separated problem 
corresponds to a step control for the problem with partial observations which has the same value. 
So we start with a Markovian step control {U, P, Y) with control process U given by 


Ut = FiiPu,Yt,), 


if t G {ti,ti+i], 


(62) 


for some functions Pj : [0, 1] x Y ^ U, i G N. 

Step 2 (Construction of a candidate control with partial observations). To construct the control 
for the problem with partial observations, we work on the canonical space = X x Py[0, oo) with 
its natural sigma algebra and filtration. The coordinates on this space are denoted by {X,Y). 
When T is a (strict) stopping time, P is a probability measure on kl, and Q is an Pr-measurable 
random variable with values in the space of probability measures on fl, then we let P (8>t Q denote 
the unique probability measure on 14 such that (i) the law of the stopped process {X, Y'^) is equal to 
P on the sigma algebra Ft and (ii) the Pr-conditional law of the time-shifted process (X, YT 4 -t)t.>n 
is Q. This notation is explained and relevant results are proven in Stroock and Varadhan 7^, 6.1.2, 
6.1.3 and 1.2.10] for continuous processes. For processes with jumps, the relevant results are Jacod 
and Shiryaev 27|, Lemmas III.2.43-48], but the notation ¥ <Sit Q is not used there. 

By [Assumption 8 we get for each (tt, x, y) G U x X x Y a unique probability measure Q“(x,y) 
on n such that X = x and Yq = y holds almost surely and such that (A, Y) solve the martingale 
problem for (A, u) under Q“(x, y). By Jacod and Shiryaev [23, Theorem IX.3.39], Q“(x, y) is weakly 
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continuous, thus measurable, in (u, x, y). Verifying the conditions of the theorem can be done as in 
the proof of ILemma 21 but it is easier in the present situation. We now define inductively for each 
n € N a probability measure P” and a cadlag process P” on 17 as follows. 


po = ho) + (1 - Vo), Pt = Epo[V 

P« = P—1 VJ, P- = Epn[V 

It follows that the measures P”' and P™ agree on Pt^Mm that the processes P” and P”^ agree 
almost surely on [0,fn A fm]- Therefore, there is a unique measure P which coincides with P” on 
Pi^, for all n. Furthermore, there is a unique cadlag process P that is almost surely equal to P” 
on [0,t„], for all n. If P is defined as 

OO 

= (64) 

i=0 



then by construction, the process 


f{X, Y) - f{X, Vo) - Af{U, X,Y).I 


(65) 


is a martingale, for each / G P{A) (see also Jacod and Shiryaev 27|, Lemma III.2.48]). 

Step 3 (Verification of the well-posedness condition). As P is the {P^^j-optional projection 




of X, it is indistinguishable from G(Y) for some cadlag process G on Py[0,oo) by Delzeith m- 


It follows from Equation (64) that U is indistinguishable from P(V) for some caglad process P 


on Py[ 0, oo). The martingale problem {A, F) is well-posed by [Assumption 8 because P is a step 
process. Thus, the well-posedness condition of IDefinition 21 is satisfied. 

Step 4 (Value of the eontrol with partial observations). The process P defined in Step 2 is the 
{Pj^}-optional projection of X. Bv ILemma 11 {U,P,Y) defines a separated control of the same 
value as {U,X,Y). [Assumption 9| implies that {U,P,Y) is equal in law to {U,P,Y). Therefore, 
{U, X, V) has the same value as {U, P,Y). □ 


E Proofs of Section 4 


The setup of [Section 4.11 including Assumptions fTHT^ holds. 

Lemma 4 (Payoff function). For any control {U, X, H, R) of the problem with partial observations, 


E 


pe 


-P^dR, 


L4o 


= E 


■ poo 

/ pe-P%{Ut,X,Ht)dt 

Jo 


( 66 ) 


where b is given by \Assumption 11 


Proof. By the integrability condition on Kji in Assumption 10 the process P is a special semi- 
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martingale. Its canonical decomposition is 


R = Ro + b{U,X,H)»I + R‘^ + r*{n-iy), (67) 

where n is the integer-valued random measure associated to the jumps of R and ly = lu=iKji(X, H, •) 
is the compensator of fi. For (^t = one obtains that 


(•R- CKU, X, H) • I = C • R"" + (r * {p - ly) 


( 68 ) 


is a local martingale. Equation (66) holds if it is a true martingale. 

Let xr{i") = x(0) ^)- The processes (^•R'^ and Cxr{i") square integrable martingales 

by the Burkholder-Davis-Gundy inequality because their quadratic variations are integrable: 


]E[[C • Rloo] = E [eiu=i^R{H)^ • /oo] < oo, 

(69) 

E[[Cxr(?’) z^)]oo] = E[C‘^XR{rf * Moo] = E[C‘^XR{rf * i^oo] < oo. 

This follows from the bounds on an and Kji in Assumptions [2] and [101 Furthermore, the process 
(r —xr(?■))* (m~^) is a uniformly integrable martingale on [0,t] because it is of integrable variation; 


E[Var(C(r - XR{r)) * (m “ i"))oo] < E [C|r - Xij(r)| * Moo] + E [Clr - XR{r)\ * Roo] 

= 2 * E [C|r - XR{r)\ * z^oo] < oo. 


(70) 


This follows from the bound on Kr in Assumption 10 Therefore, the process in Equation (68) is 
a martingale, and Equation (66) holds. □ 


Lemma 5 (Elimination of the state variable r). The value functions V{p,h,r), V^{p,h,r) do not 
depend on r and can be written as V{p,h), V^{p,h). 


Proof. For any s G M and / G T){A), let fs{x,h,r) = f{x,h,r + s). Then fg G T’(Ml), and 
by [Assumption lOj Af{u,x,h,r + s) = Afs{u,x,h,r). If {U,X,H,R) is a control with partial 
observations, then the equation 


f{X, H,R + s)- f{X, Ho, Ro + s)- Af{U, X,H,R + s)*I 

= fs{X, H, R) - MX, Ho, Ro) - AfsiU, X, H,R). I (71) 

shows that {U, X,H,R + s) is also a control with partial observations. Moreover, the two controls 
have the same value. The same argumentation applies to separated controls. □ 

Lemma 6. The value functions V{p,h) and V^{p,h) are convex, non-decreasing in p, and non¬ 
decreasing in h. 

Proof. Recall from Step 5 in the proof of ILemma 21 that the value of any separated control can be 
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written as 


r'^{U,P,H,R)= [ 

JLj 


J(C/)P(dC/), 


(72) 


' Ld[0,oo) 

where P(dt7) is the marginal distribution of t/ G Lu[0,oo) and J{U) is the value of a deterministic 


control process U. In the definition of J{U) in Equation (50) P is a martingale and {U,H) are 
deterministic. Therefore, 


J{U) = E 


pe-P^b{Ut,PuHt)dt 


roo roo 

= Po / pe-P%Ut, 1, Ht)dt + (1 - Po) / pe-P%Ut,0, Ht)dt. (73) 
Jo Jo 


This expression is linear in Pq and non-decreasing in (Pqj Hq) by [Assumption Taking the supre- 
mum over all controls or step controls with fixed initial condition {Pq,P[q), one obtains convexity 
in Po and monotonicity in {Pq,Hq). □ 

Lemma 7 (Sufficient condition for optimality of the risky arm). In the discretized separated prob¬ 
lem, the risky arm is uniquely optimal as an initial choice if its expected first-stage payoff exceeds 
the first-stage payoff of the safe arm. 

Proof. We fix (5 > 0 and only allow control processes which are piecewise constant on the uniform 
time grid of step size 5. The expected first-stage payoff is denoted by 


b {u,p, /i) = E 


pe P^b{u,Pt,Ht)dt 


Uo 


(74) 


where {P,H) stems from a separated control with initial condition (Po,Po) = and constant 

control process Ut = u. By Bellman’s principle, optimal initial choices Uq for the discretized 
separated problem are maximizers of 


max b {u,p,h) + e-P^^ V\Ps,Hs) {Uo,Po,Ho) = {u,p,h) 
uGU l 

Thus, the optimal initial choice depends on the sign of the quantity 


(75) 


h\l,p,h)-b\d,p,h)+e-P^¥. V\Ps,Hs) {Uo,Po,Ho) = il,p,h) 


- e-P^ E 


V\Ps,Hs) {Uo,Po,Ho) = iO,p,h) , (76) 


which is the advantage of the risky arm over the safe arm. For each u G U, let hu be the deterministic 
value which attains after an initial choice of u. By Assumption 12 the inequality ho < h < hi 
holds. Furthermore, Ps = Pq holds under an initial choice u = 0. By the monotonicity and 
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convexity result of ILemma 61 


E 


V\Ps,Hs) {Uo,Po,Ho) = {l,p,h) 


- E 


V\Ps,Hs) iUo,Po,Ho) = {0,p,h) 


= E 
> E 


V\Ps,hi) {Uo,Po,Ho) = il,p,h)\ -V\pM 

V\PsM) (C/o,Po,^o) = -V\pM)>^- 


( 77 ) 


It follows that 1(76)1 strictly positive if if {l,p,h) > f{0,p,h). In this case, the initial choice of 
the risky arm is uniquely optimal. □ 

Lemma 8 (Optimality of stopping rules). For each 5 > 0, V^{p,h) is a supremum over values of 
stopping rules. 


Proof. Step 1 (Discrete setting). We hx d > 0 and work on the uniform time grid = i6, i £ N. 
The one-stage payoff of the problem with partial observations is given by 

rS 

b^{u,x,h)= / pe~^^b{u, X, Ht)dt, (78) 

Jo 

where H stems from a control with partial observations with constant control process Ut^u and 
initial condition (X, Hq) = (x, h). The one-stage payoff of the safe arm is 

nS 

f = f{0,x,h)= / pe~^^kdt. (79) 

Jo 

By abuse of notation, we identify indices z G N with times tj, writing Ui for the value of U on 
{ti,ti+i] and {Pi, Hi, Ri) for the value of {P,H,R) at ti. 

Step 2 (Finite horizon). We truncate the problem with partial observations to a finite time 
horizon n. In the truncated problem, the value of a control {U,X, H, R) is given by 


J^°-{U,X,H,R) = E 


Y,e-P^'f{Ui,X,Hi) 

. 1=0 


(80) 


We will show by induction on n that there exists an optimal stopping rule, i.e., a control that never 
switches from safe to the risky arm. For n = 0, there is nothing to prove. Now let {U, X, H, R) be 
an optimal control for the problem with horizon n -|- 1 constructed via ILemma'31 from an optimal 
Markovian control for the truncated separated problem. As H evolves deterministically given JP 
it is possible to write U = F{R) for a piecewise constant process F on the path space T1 r[ 0, oo)o 
The inductive hypothesis allows one to assume that for i > 1, Ui never switches from the safe to 


^^This is easily seen for C/q, which is deterministic. For Ui+\, it follows by induction because Hi+\ is a deterministic 
function of Ui. 
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the risky arm. If Uq indicates the risky arm, the proof is complete. Otherwise, U has the form 


Ui = 


0, if i = 0 or i > T, 
1, if 1 < i < T, 


(81) 


for some stopping time T. Given that the safe arm is chosen initially, the reward process R is 
deterministic during the first stage. Therefore, there is a modification of F that does not depend 
on the path of R on the interval [0,(5]. This makes it possible to define an adapted process F* 
which skips the first action of F. Then T is a stopping rule. Formally, F* can be defined as 


F*{R) = S^F{S-^R), 


(82) 


where for any process Z, {S^Z)t = Z is a shift of Z by <5. As the martingale problem 
{A,F*) is well-posed, there is a corresponding control {U*, X*, H*, R*) with U* = F*{R*) and 
initial condition E[A*] = K[X],Hq = Hq. For comparison, we also define {U^, X^, , R^) as the 
control where the risky arm = 0 is chosen all the time, still with the same initial condition 
E[A°] = E[A],ifg = Hq. The values of the controls are denoted by J, J*, and J°, respectively. 
Then 


J*-J° 

J- 


/T-l 


E >0. 


(83) 

(84) 


The hrst inequality holds because choosing the safe arm decreases H, see Assumption 12 The 
second inequality holds because the value J of the optimal control is at least as high as J°. Thus, 


J* - J> E 


1 

{b\l,X,Hi) - k^) 

OO 

= E li<T{b\l,X,Hi) - k^) . (85) 


_2 = 1 


1=1 


= \bi 


The increments of (6j)jgN are given by 

hi+r -bi= E\li<T{b\l,X,Hi+^)-b\l,X,Hi)) 


+ E 


-b\l,X,H,+^))]. (86) 


The first summand on the right-hand side is non-negative for f > 1 because H increases while the 
risky arm is played. By the J^/^^-measurability of li=T and Fli+i, the second summand can be 
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written as 


E 


= E 


= E 

li=T(^^-^^(l, .Pr+i,-f^T+i)) 






(87) 


where b {u,p, h) is defined in Equation (74) As it is optimal under U (see Equation (81)) to choose 
the safe arm at stage T + 1, the inequality >b (1, Pt’+i, Hx+i) holds bv ILemma 71 This proves 
bi+i > bi, for all f > 1. By Equation (84) we also have 


^ > 0 . 

i=l 


By Berry and Eristedt [3, Equation (5.2.8)] this implies 


( 88 ) 


J*-J = Y^ > 0 ^ ( 89 ) 

i=l 


since truncated geometric discount sequences are regular. Thus we have constructed an optimal 
stopping rule {U*,X*, H*,R*) for the truncated problem with horizon n + 1. 

Step 2 (Infinite horizon). We have shown that stopping rules are optimal for each discretized 
problem with finite horizon n. It follows by approximation that the value function V^{p,h) of the 
discretized problem with infinite horizon is a supremum over stopping rules. The argument can be 
found in the proof of Berry and Eristedt [3, Theorem 5.2.2]. □ 

Lemma 9 (Description of optimal stopping rules). The stopping time T* = inf{t : V{Pt, Ht) < k) 
is optimal for the separated problem. 


Proof. For each (p, h) € [0,1] x H, there is a unique solution (P, H, R) of the martingale problem for 
{G, 1) by Assumption 9 The family (P, H) of processes, indexed by the initial condition (p, h), is a 
Feller process. This follows from Jacod and Shiryaev [271, Theorem IX.4.39] using similar arguments 
as in Step 3 of the proof of ILemma 21 Let (P, H) be the killed version of (P, H) with killing rate p 
and let A denote the “cemetery point” of the killed process. We refer to Peskir and Shiryaev 
Section II.5.4] for the terminology. Let b{u, A) = 0 and 


At = Ao+ f {b{l,Pt,Ht) - k)dt. (90) 

Jo 

Then Z = (P, H, A) is a Feller process on the state space Z = ([0,1] x H U {9}) x M. Let {Fz)zez 
denote the family of laws of Z starting from the initial condition Zq = z. There is an associated 
family of stopping problems 

W(z) = supE,(At), (91) 

T 
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where the supremum is taken over all {J^j^j-stopping times. For any 2 ; = {p, h, a) / A, 


W{z) = sup E(j,^h,a)[^T] = sup E(p^;,^o)[^t] + « 


= sup 
T 


[ pe P\b{l,Pt,Ht) - k)dt 

Jo 


+ a = V{p, h) — k + a, 


(92) 


because V{p, y) is a supremum of values of stopping rules by part (a) of ITheorem 21 The stopping 
set D C Z is defined as in Peskir and Shiryaev 5a, Equation (2.2.5)] by 


B = {z = (p, /i, a) E Z: W{z) < a} = ^{(p, /i) E [0,1] x El: E(p, /i) < /c} U {A}^ x M. (93) 

The last equality holds because W{d, o) = a by definition. The function W is lower semi-continuous 
by Peskir and Shiryaev [^, Equation (2.2.80)] because {P, H, A) is Eeller. Therefore, the set B is 
closed. Then the right-continuity of the filtration implies that 


T* = inf{t >0:XteD} = inf{t : V{Pt,Ht) < k} 


(94) 


is a stopping time. Note that A E D, which implies P(T* < 00 ) = 1. Then Peskir and Shiryaev 
581 . Corollary 2.9] implies that T* is optimal. □ 


Lemma 10 (Asymptotic learning). Assume 0 < Pq < 1- Then the following statements hold 
for any control {U, X, P[, R) of the problem with partial observations and the corresponding belief 
process P. 

(a) Assume that the measures Kji{l,h,-) and Kji{0,h,-) are equivalent for all h. Then learning 
in finite time is impossible, i.e., 0 < Pj < 1 holds a.s. for all t > 0. Moreover, asymptotic 
learning does not occur if the agent invests only a finite amount of time into the risky arm, 
i.e., 

{/g°° Utdt < 00 } C {0 < Poo < 1} P-a.s. (95) 

(b) Assume that 4’(1,-) is bounded from below by a positive constant. Then asymptotic learning 
is guaranteed if the agent invests an infinite amount of time in the risky arm, i. e., 


{/“ Utdt = 00 } C {Poo = A} P-a.s. 


(96) 


(c) If the conditions of (a) and (b) are satisfied, then asymptotic learning occurs if and only if 
the agent invests an infinite amount of time in the risky arm: 


{/o“ Utdt = 00 } = {Poo = X} p-a.s. 


(97) 


Proof. Step 1 (Hellinger process). Let Pi and Pq be defined by conditioning the measure P on the 
events X = 1 and X = 0, respectively. We want to calculate the Hellinger process h{^) of order 
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^ of the measures Pi and Pq. Let Pt = E[X | P^] be the belief process. By Equation (7), P/Pq 
is the density process of Pi relative to P. Similarly, (1 — P)/(l — Pq) is the density process of Pq 
relative to P. For all p, g € M, let 

(98) 




and let u{dt, dp, d( 7 ) be the compensator of the integer-valued random measure associated the jumps 
of {P, 1 — P). Let S be the first time that P or P_ hits zero or one, 


S' = inf > 0 : Pt € {0,1} or Pt_ G {0,1}}. 


(99) 


By Jacod and Shiryaev 27|, Lemma IIL3.7], P is constant on [S',oo|. Therefore, on this interval, 
{P^, P^) is constant and v has no char ge. After canceling out the terms Pq and (1 — Pq), the formula 
for /i(i) given in Jacod and Shiryaev 27], Theorem IV.1.33] reads as 


1 / 1 




p_(l-p_) 


(P", 1 - P") + 


(l-p_) 


(1 - p^ 1 - p^ 


+ V' (1 1 - p_ 


* iy{dt, dp, d(7) 


1/1 1 

+ 


{PZPZ 

'-, 1 - 


p_ l-p_ 

1 j{u,p_,Y_,z) 


1 


p_ ’ 1 - p_ 

= ^<^i(p,v)^u2(p,y),^i(p,y)./^ 

MU.Y,z) 


*K{U,P-,Y_,dz)dt 


+ V’ 


p_02(p, y-, z) + (1 - P-)(2 - MU, Y.,z)) 


-^ ^ MU,Y z) -^^ ) 1 | 0 si*KiU,P-,Y.,dz)dt 

P.MU,Y_,z) + {1-P.){2-MU,Y.,z)) i 


1 


= ^<^i(p,y)^u2(p,y),^i(p,y).p 


+ 


i-J mu,y,z){2-mu,y,z)) _ 

^ -K{U, P, y, dz) • 


PMU, y, z) + (1 - P) (2 - MU, Y, z)) 

= ^MU,Y)'^a\U,Y)MU,Y)*I^ 

+ y (i - ^Jmu,y,z){2-mu,y,z))^ k{u, 1/2,y,dz ).= $(P,y). i- 

where d> is defined in Assumption 5 


( 100 ) 


Step 2 (Finite investment prevents asymptotic learning). We define stopping times T and r„ as 
in Equations (34) and |(15)] T is the first time that P or P_ hits zero and Tn announces T. Let us 
assume for contradiction that P jumps to zero, i.e., Pt- > 0. Then = T holds for all sufficiently 
large n. Consequently, the process = E{L'^) = pU JPq defined in Equation (17) also jumps to 
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zero. Therefore, L” has a jump of height —1. This is not possible because (j) 2 {u,y,z) > 0 holds by 
the assumption that K{u, 1, y, •) and K{u, 0, y, •) are equivalent. This proves that P does not jump 
to zero. A similar argument where the roles of Pq and Pi are reversed shows that P cannot jump 
to one. It follows that for any stopping time r, 


{h{^)r = oo} = {S < T, Ps- = 0} = {Pr = 0} = {Pr = 0 or P,- = 1} PQ-a.s., 
{h{h)r = oo} = {S < T, Ps- = 1} = {Pr = 1} = {Pr = 0 or P,- = 1} Pi-a.s. 


( 101 ) 

( 102 ) 


In Equations (101)| and |(102)[ the first equality holds by Schachermayer and Schachinger 671 . 
Theorem 1.5]. This theorem states that the divergence of the Hellinger process is equivalent to 
the mutual singularity of the measures Pi and Pq, but in such a way that the singularity is not 
obtained by a sudden jump of the density process to zero or one. The second equality holds because 
such jumps are not possible by the previous claim. For the third equality, see Jacod and Shiryaev 
271 . Proposition III.3.5.(ii)]. By [Assumption 10| the safe arm reveals no information about the 


hidden state X, resulting in <h(0, y) = 0. Together with Assumption 5 bounding from above. 
Equations |(101)| and |(102)| imply 


{ Utdt < oo} C {/i(i)oo < oo} = |0 < Po, 


This proves (a). 

Step 3 (Infinite investment induces asymptotic learning). Let r be a stopping time. If S does 
not occur before r and Utdt = oo, then h{^)r = oo because of the lower bound inf^ <1>(1, y) > 0. 
Therefore, 

(fo ^tdt = oo} C {/i(^)r < oo} U {S' < r}. (104) 


Moreover, it follows from Schachermayer and Schachinger 



Theorem 


{h{^)r < QO} U {S <t} = {S <T, Ps- = 0} U (5 < r} = (P, = 
{h{^)r < oo} U {5 < r} = {5 < T, Ps- = 1} u (5 < r} = (P^ = 


It follows that 

{/({ Utdt = oo} C {P, = X} P-a.s., 
which proves (b). Finally, (c) follows from (a) and (b). 


1.5] that 

X} Pg-a.s., (105) 

X} Pi-a.s. (106) 

(107) 

□ 
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